I recently did up a diagram of how our Bugzilla site was set up, mostly for the benefit of other sysadmins trying to find the various pieces of it. Several folks expressed interest in sharing it with the community just to show an example of how we were set up. So I cleaned it up a little, and here it is:
At first glance it looks somewhat excessive just for a Bugzilla, but since the Mozilla Project lives and dies by the content of this site, all work pretty much stops if it doesn’t work, so it’s one of our highest-priority sites to keep operating at all times for developer support. The actual hardware required to run the site at full capacity for the amount of users we get hitting it is a little less than half of what’s shown in the diagram.
We have the entire site set up in two different datacenters (SJC1 is our San Jose datacenter, PHX1 is our Phoenix datacenter). Thanks to the load balancers taking care of the cross-datacenter connections for the master databases, it’s actually possible to run it from both sites concurrently to split the load. But because of the amount of traffic Bugzilla does to the master databases, and the latency in connection setup over that distance, it’s a little bit slow from whichever datacenter isn’t currently hosting the master, so we’ve been trying to keep DNS pointed at just one of them to keep it speedy.
This still works great as a hot failover, though, which got tested in action this last Sunday when we had a system board failure on the master database server in Phoenix. Failing the entire site over to San Jose took only minutes, and the tech from HP showed up to swap the system board 4 hours later. The fun part was that I had only finished setting up this hot failover setup about a week prior, so the timing couldn’t have been any better for that system board failure. If it had happened any sooner we might have been down for a long time waiting for the server to get fixed.
When everything is operational, we’re trying to keep it primarily hosted in Phoenix. As you can see in the diagram, the database servers in Phoenix are using solid-state disks for the database storage. The speed improvement when running large queries that is gained by using these instead of traditional spinning disks is just amazing. I haven’t done any actual timing to get hard facts on that, but the difference is large enough that you can easily notice it just from using the site.
Most people reading this probably already know that Mozilla utilizes a network of volunteers hosting downloads of Firefox (and other Mozilla products). All of these volunteer sites are listed in a database with a numeric weight that shows, relative to the other sites in the database, how much traffic they can handle. When you click the download link for Firefox on the mozilla.com site, you get sent to download.mozilla.org, which picks one of the sites out of that list at random (the chance of a given site getting picked is its weight divided by the sum of the weights of all of the available sites) and redirects you to that site to download the file.
When you think about the sheer number of people using Firefox these days (These 9-month-old stats say we have 60 million active daily users – I’m sure it’s probably grown since then), and Firefox’s built-in application update functionality that notifies users that a new version is available and installs it for them, that means when we release a security update, we’re going to have at least 60 million downloads of it (mostly via the automatic update service) within that first 24 hours after release. The amount of bandwidth required to host that many downloads in one day is staggering. We don’t have that much bandwidth available in Mozilla’s datacenters, which is why we rely on our network of mirror sites for the downloads. Each one of these sites may only be able to handle a small number of downloads, but when you add them all together, there’s a lot more capacity than our datacenters have.
Recently, as the number of Firefox users continues to grow, even our network of download mirror sites was starting to feel the pinch from the sheer volume of downloads during our security releases. During both the Firefox 3.0.6 and 3.0.7 releases, we ended up having to enact a throttling mechanism on the update service in order to slow down the number of downloads being requested to a point where we weren’t completely burying all of our volunteer download sites, many of whom also host downloads for other open source projects besides Firefox. When a Firefox application would check to see if there’s an update available, a percentage of users were told there wasn’t one, even though there really was. If you manually picked “Check for Updates” from the Help menu, you always got it, though. It was only the automatic checks that were throttled. This mechanism is always our last resort. When we have a security update, we want it in the end-user’s hands as quickly as possible. Delaying it for a day for a percentage of users is completely counter to that goal, and so we try everything to avoid having to use it.
On Wednesday this last week, when the firedrill started that became the Firefox 3.0.8 release, I sent out an email to all of our download mirror admins, warning them that Firefox 3.0.8 was imminent. I also pointed out how we had ended up needing to throttle updates during the 3.0.6 and 3.0.7 releases, and saying I still didn’t think we had enough capacity on the download mirror network to handle the release. Paraphrased, “If you know anyone who’s not mirroring Mozilla yet, and would like to, get them in touch with me.”
The community came through in shining colors. In the 48 hours following that email, we increased the capacity of the download mirror network by more than half. We left the peak traffic period of the first full day of Firefox 3.0.8 downloads about 6 to 8 hours ago, and I’m quite happy to report that we never had to throttle the updates at all for the Firefox 3.0.8 release. Every Firefox browser that checked in to see if there was an update available got its update notification. Not only that, but I had mirror admins telling me on IRC during the peak traffic hours “hey, my site can still handle more traffic, go ahead and bump my weight up some.” Quite a welcome change from all the reports of dead servers during the last two releases.
Now, to be fair, we did have one other thing going for us. This release happened in the afternoon on a Friday. This effectively splits most of the download traffic between the home users on Saturday and the business users this coming Monday. Given the way we performed today, though, I’m pretty confident we could have handled the release happening on another weekday anyway.
But, the weekend lull brings up one other point that was raised on IRC by Mike Beltzner yesterday… As I mentioned above, when we have a security update, the goal is to get it into the hands of the end user as fast as possible. Mozilla’s QA signed off on the release in the early morning hours on Friday. The bits were released out to the download mirror sites shortly afterwards. Enough of those mirror sites had picked up the files by late morning to handle the normal release traffic, but we had to wait until mid-afternoon to release because of a long-standing tradition based on capacity planning to schedule the release at a time of day that will cause the fewest simultaneous downloads to avoid overloading the download mirror network. Quite simply: the Pacific Ocean covers a lot of timezones. Enough of them that there’s a general lull in internet traffic when it’s daytime hours over the Pacific Ocean. If we schedule the release to happen as the west coast of North America is going offline for the day, then we start picking up traffic one timezone at a time as Asia and then eastern Europe start coming online for the new day.
In the interest of getting the update out to the end users as quickly as possible, wouldn’t it be great if we had enough capacity on our download mirror network that we didn’t have to wait for the lull in Internet traffic caused by the Pacific Ocean to do a release? If we released in the early morning hours in the US, we’d have a large number of timezones online for the day at release time and would get a much larger percentage of the users in the first few hours after release.
So, how about it, community? You guys pulled off some awesomeness this last week. Let’s see if we can do a little bit more, and be able to handle a release without regard for the time of day it happens! If you know anyone who might be willing to host downloads for us, have them check out our Mirroring Instructions Page. All the info about how to set up and get included in the download pool is listed there.
Here’s some numbers: For a normal Firefox security update, I’ve been saying that we need an availability rating of 35000 or higher to handle the release traffic. That number is the sum of the weights of the available mirrors that currently have the release files. There’s a loose perception of that number being tied to the amount of available Mbit of download bandwidth, but that’s not really accurate, and depends on a lot of factors. During both of the Firefox 3.0.6 and 3.0.7 releases, that number was hovering around 26000 most of the time. For most of the Firefox 3.0.8 release so far, it’s been somewhere between 45000 and 55000. I’m betting we could probably handle the traffic that would be generated by a morning release if we got that above 65000 consistently.
I’m very much a power user. I use my web browser constantly, for both work and play, and get extensive use out of tabbed browsing. I keep web pages that are related to the same task in tabs in the same window, and open a new window when I’m shifting gears to work on another task. And I often go back and forth between tasks as things come up or I need a break from the routine, or whatever, so eventually I wind up with a situation like right now where I have 12 windows open, and about half of them have 5 or 10 tabs in them.
Now I’m looking for a specific tab, and it was sort of a one-off thing, and I don’t remember which window it’s in. And it’s not the frontmost tab in the window it’s in, so I can’t just look in the Window menu to find it.
Now I’m thinking it would be really cool if the Window menu had submenus for each window that had multiple tabs in it, which listed the tabs in that window. Then I could just mouse over the windows in the Window menu and glance through the submenus looking for it. Bug 405933 filed.