The hardware behind bugzilla.mozilla.org

I recently did up a diagram of how our Bugzilla site was set up, mostly for the benefit of other sysadmins trying to find the various pieces of it.  Several folks expressed interest in sharing it with the community just to show an example of how we were set up.  So I cleaned it up a little, and here it is:

Bugzilla Physical Layout Diagram
Click the image for a full-size (readable) version

At first glance it looks somewhat excessive just for a Bugzilla, but since the Mozilla Project lives and dies by the content of this site, all work pretty much stops if it doesn’t work, so it’s one of our highest-priority sites to keep operating at all times for developer support.  The actual hardware required to run the site at full capacity for the amount of users we get hitting it is a little less than half of what’s shown in the diagram.

We have the entire site set up in two different datacenters (SJC1 is our San Jose datacenter, PHX1 is our Phoenix datacenter).  Thanks to the load balancers taking care of the cross-datacenter connections for the master databases, it’s actually possible to run it from both sites concurrently to split the load.  But because of the amount of traffic Bugzilla does to the master databases, and the latency in connection setup over that distance, it’s a little bit slow from whichever datacenter isn’t currently hosting the master, so we’ve been trying to keep DNS pointed at just one of them to keep it speedy.

This still works great as a hot failover, though, which got tested in action this last Sunday when we had a system board failure on the master database server in Phoenix.  Failing the entire site over to San Jose took only minutes, and the tech from HP showed up to swap the system board 4 hours later.  The fun part was that I had only finished setting up this hot failover setup about a week prior, so the timing couldn’t have been any better for that system board failure.  If it had happened any sooner we might have been down for a long time waiting for the server to get fixed.

When everything is operational, we’re trying to keep it primarily hosted in Phoenix.  As you can see in the diagram, the database servers in Phoenix are using solid-state disks for the database storage. The speed improvement when running large queries that is gained by using these instead of traditional spinning disks is just amazing.  I haven’t done any actual timing to get hard facts on that, but the difference is large enough that you can easily notice it just from using the site.

Android application wish list: monitor phone status via a web page served from the phone

I have an HTC Evo running Android 2.2 (Froyo).  I absolutely love the thing.  The only downside is the short battery life.  One really cool feature it has is the WiFi HotSpot utility, which lets you turn the phone into a WiFi access point, sharing the 3G or 4G internet connection to the wifi.  It turns out that the Evo has a really good WiFi antenna in it, too.  It has better range as an access point than my Linksys WRT54G at home (even without the fancy antennas sticking out of the back).  This really comes in handy when I’m at my parents’ house up north in the boonies, where there is no broadband internet available, and the cell coverage is spotty at best.  I have to wander the house in search of a spot that manages to maintain a signal long enough to do anything useful with it.  Gone are the days of trying to hold your laptop in some weird spot for the data card to get enough signal.  Now once I find a spot where the phone can keep a signal, I can just leave the phone there, and go wherever I want with the laptop to sit in comfort and use it via the WiFi hotspot.

Now, being that I’m out in the boonies, even the best spot in the house to leave the phone at still has a spotty connection that comes and goes.  Staying somewhere within visual range of the phone so I can watch the signal strength display (and going and waking the phone back up when it puts the screen to sleep) is a pain.  I’d love it if there was an app I could run which would serve up the current phone signal strength and data status over http on the phone’s internal wifi interface (even if it’s on some random port, since you can’t use port 80 without rooting the phone).  For the variety of stuff that’s in the marketplace, it would surprise me if such a thing didn’t already exist.  But it’s also hard enough to describe that any keywords I can think of to search with give me several dozen unrelated hits.  So, Lazyweb, anyone know of such a thing?

Thanks!

Upgrading bugzilla.mozilla.org to version 3.4.3

We’re finally at the point where I can say we’re ready to upgrade Bugzilla @ Mozilla this weekend.  We’re aiming for Sunday evening (probably 6pm PST).  I’ll post again when I know how long it’ll be down for (and that’ll be included in the eventual downtime notice on the IT blog as well).

There’s a staging copy set up at https://bugzilla-stage-tip.mozilla.org/ and I would appreciate people playing around with it and finding anything that might be broken before we get it to production.  Before filing bugs, make sure to check the detailed status linked from the red box at the top of every page to make sure it’s not already listed (and you can also see my progress on cosmetic issues and so forth, there).

It will be down for a while at some point tonight when I reload it with an up-to-date snapshot of the production server (and that’ll be my test to find out how long it’ll take to upgrade it, too).  I’m super excited because this has been a long time coming. 🙂

Upgrading from RHEL4 to RHEL5 without a CD or a PXE server

At Mozilla, since our server farm has gotten so big, we’ve gotten reloading and upgrading machines kind of down to a science. We have a PXE server in each colo, and installations and upgrades run over the network. It’s a great system until you get to one of the following two situations:

  1. The machine you need to upgrade is in someone else’s colo, with no local PXE server you have control over and no one on-site to do CD flipping for you. -or-
  2. The machine you need to upgrade *is* the PXE server.

We have situation #2 coming up soon, but I had a machine with situation #1 that I recently experimented on to get this all working without PXE or a CD.

We had a major push to get the entire RHEL4 portion of our infrastructure either reloaded or upgraded to RHEL5 about a year ago. There were several machines that didn’t get upgraded for one of a few reasons at the time. Machines that were due to be replaced soon anyway, machines that ran the RHN Proxies (because RHN Proxy didn’t run on RHEL5 yet at the time), and machines that had software running on them that for some reason didn’t work on RHEL5 yet at the time (Zimbra with clustering support in our case). And also machines that were in remote locations with no PXE and nobody to flip CDs.

RHN has a remote kickstart capability. I’ve experimented with it before, but never had too much success getting it working. I probably just didn’t play enough. But going on the general concept of how it worked, I discovered it was quite easy to get booted into the Anaconda installer from grub… Just copy the isolinux directory off the CD/DVD into /boot, and set up an entry in grub.conf for it that looked very similar to the one in pxelinux.cfg on the PXE server.

The big problem comes with where to have it locate the install media. Conventional wisdom says I’ll get the most bang for the time spent if I put it right on that machine’s hard drive. Only problem: Anaconda doesn’t know how to talk LVM until after it gets stage2.img loaded, and it needs to already have access to the install media to load that. Almost all of these machines have LVM for the main partitions. In the end, I ended up putting an “allow from” line in the apache config on our main kickstart server to allow it to be accessed from this machine’s IP address via the Internet, and just loaded everything over the Internet. Slowed it down quite a bit, but it worked… almost.

Once it got to the point of locating an existing installation to upgrade, it died with “Upgrading between major versions is not supported.” Say what? Last year we upgraded a good couple dozen RHEL4 boxes to RHEL5 in place and it worked just fine (with a few caveats – a few things break, but it’s easy to clean up in %post in your kickstart). Well, RHEL is currently 5.3, it was 5.1 at the time when we did it before. I’d long since discarded the 5.1 images off the kickstart server. So I downloaded the 5.1 DVD again, and staged it on the kickstart server again, adjusted the kickstart file to install 5.1 instead of 5.3, and set it off to do its thing again. This time, success. The machine is now upgraded to RHEL 5.1. Now I just have to use yum to update it the rest of the way to RHEL 5.3.  yum and glibc need to be upgraded first, then everything else.

Since I know someone will try to point it out (someone always does), yes, it is generally better to cleanly install RHEL5 from scratch than to try to ugprade from RHEL4 in place (this is probably why RedHat disabled it on the newer installers). Some situations make that not easy (like this one, where I have no PXE available and nobody local to the machine to flip CDs for me, not to mention the lengthy amount of time the machine would be down if I had to restore a few dozen GB of backups remotely over the Internet after a clean reload). This is another one of those times that makes me wonder why Red Hat can’t be as easy to upgrade as Ubuntu is, which not only has a way to upgrade from one release to the next in place without even needing to boot directly into an installer, but fully supports and encourages that upgrade method. Red Hat has some learning to do here.

We only have 6 machines left on RHEL4 now, and one of those is a build machine that can just go away once we’re no longer having to support RHEL4.  Almost there. 🙂

Help from the community ensures a smooth download experience for Firefox 3.0.8

Most people reading this probably already know that Mozilla utilizes a network of volunteers hosting downloads of Firefox (and other Mozilla products).  All of these volunteer sites are listed in a database with a numeric weight that shows, relative to the other sites in the database, how much traffic they can handle.  When you click the download link for Firefox on the mozilla.com site, you get sent to download.mozilla.org, which picks one of the sites out of that list at random (the chance of a given site getting picked is its weight divided by the sum of the weights of all of the available sites) and redirects you to that site to download the file.

When you think about the sheer number of people using Firefox these days (These 9-month-old stats say we have 60 million active daily users – I’m sure it’s probably grown since then), and Firefox’s built-in application update functionality that notifies users that a new version is available and installs it for them, that means when we release a security update, we’re going to have at least 60 million downloads of it (mostly via the automatic update service) within that first 24 hours after release.  The amount of bandwidth required to host that many downloads in one day is staggering.  We don’t have that much bandwidth available in Mozilla’s datacenters, which is why we rely on our network of mirror sites for the downloads.  Each one of these sites may only be able to handle a small number of downloads, but when you add them all together, there’s a lot more capacity than our datacenters have. 🙂

Recently, as the number of Firefox users continues to grow, even our network of download mirror sites was starting to feel the pinch from the sheer volume of downloads during our security releases.  During both the Firefox 3.0.6 and 3.0.7 releases, we ended up having to enact a throttling mechanism on the update service in order to slow down the number of downloads being requested to a point where we weren’t completely burying all of our volunteer download sites, many of whom also host downloads for other open source projects besides Firefox.  When a Firefox application would check to see if there’s an update available, a percentage of users were told there wasn’t one, even though there really was.  If you manually picked “Check for Updates” from the Help menu, you always got it, though.  It was only the automatic checks that were throttled.  This mechanism is always our last resort.  When we have a security update, we want it in the end-user’s hands as quickly as possible. Delaying it for a day for a percentage of users is completely counter to that goal, and so we try everything to avoid having to use it.

On Wednesday this last week, when the firedrill started that became the Firefox 3.0.8 release, I sent out an email to all of our download mirror admins, warning them that Firefox 3.0.8 was imminent.  I also pointed out how we had ended up needing to throttle updates during the 3.0.6 and 3.0.7 releases, and saying I still didn’t think we had enough capacity on the download mirror network to handle the release.  Paraphrased, “If you know anyone who’s not mirroring Mozilla yet, and would like to, get them in touch with me.”

The community came through in shining colors. In the 48 hours following that email, we increased the capacity of the download mirror network by more than half.  We left the peak traffic period of the first full day of Firefox 3.0.8 downloads about 6 to 8 hours ago, and I’m quite happy to report that we never had to throttle the updates at all for the Firefox 3.0.8 release.  Every Firefox browser that checked in to see if there was an update available got its update notification.  Not only that, but I had mirror admins telling me on IRC during the peak traffic hours “hey, my site can still handle more traffic, go ahead and bump my weight up some.”  Quite a welcome change from all the reports of dead servers during the last two releases. 🙂

Now, to be fair, we did have one other thing going for us.  This release happened in the afternoon on a Friday.  This effectively splits most of the download traffic between the home users on Saturday and the business users this coming Monday.  Given the way we performed today, though, I’m pretty confident we could have handled the release happening on another weekday anyway.

But, the weekend lull brings up one other point that was raised on IRC by Mike Beltzner yesterday…  As I mentioned above, when we have a security update, the goal is to get it into the hands of the end user as fast as possible.  Mozilla’s QA signed off on the release in the early morning hours on Friday.  The bits were released out to the download mirror sites shortly afterwards.  Enough of those mirror sites had picked up the files by late morning to handle the normal release traffic, but we had to wait until mid-afternoon to release because of a long-standing tradition based on capacity planning to schedule the release at a time of day that will cause the fewest simultaneous downloads to avoid overloading the download mirror network.  Quite simply: the Pacific Ocean covers a lot of timezones.  Enough of them that there’s a general lull in internet traffic when it’s daytime hours over the Pacific Ocean.  If we schedule the release to happen as the west coast of North America is going offline for the day, then we start picking up traffic one timezone at a time as Asia and then eastern Europe start coming online for the new day.

In the interest of getting the update out to the end users as quickly as possible, wouldn’t it be great if we had enough capacity on our download mirror network that we didn’t have to wait for the lull in Internet traffic caused by the Pacific Ocean to do a release?  If we released in the early morning hours in the US, we’d have a large number of timezones online for the day at release time and would get a much larger percentage of the users in the first few hours after release.

So, how about it, community? You guys pulled off some awesomeness this last week. Let’s see if we can do a little bit more, and be able to handle a release without regard for the time of day it happens!  If you know anyone who might be willing to host downloads for us, have them check out our Mirroring Instructions Page.  All the info about how to set up and get included in the download pool is listed there.

Here’s some numbers:  For a normal Firefox security update, I’ve been saying that we need an availability rating of 35000 or higher to handle the release traffic.  That number is the sum of the weights of the available mirrors that currently have the release files.  There’s a loose perception of that number being tied to the amount of available Mbit of download bandwidth, but that’s not really accurate, and depends on a lot of factors.  During both of the Firefox 3.0.6 and 3.0.7 releases, that number was hovering around 26000 most of the time.  For most of the Firefox 3.0.8 release so far, it’s been somewhere between 45000 and 55000.  I’m betting we could probably handle the traffic that would be generated by a morning release if we got that above 65000 consistently.