The hardware behind

I recently did up a diagram of how our Bugzilla site was set up, mostly for the benefit of other sysadmins trying to find the various pieces of it.  Several folks expressed interest in sharing it with the community just to show an example of how we were set up.  So I cleaned it up a little, and here it is:

Bugzilla Physical Layout Diagram
Click the image for a full-size (readable) version

At first glance it looks somewhat excessive just for a Bugzilla, but since the Mozilla Project lives and dies by the content of this site, all work pretty much stops if it doesn’t work, so it’s one of our highest-priority sites to keep operating at all times for developer support.  The actual hardware required to run the site at full capacity for the amount of users we get hitting it is a little less than half of what’s shown in the diagram.

We have the entire site set up in two different datacenters (SJC1 is our San Jose datacenter, PHX1 is our Phoenix datacenter).  Thanks to the load balancers taking care of the cross-datacenter connections for the master databases, it’s actually possible to run it from both sites concurrently to split the load.  But because of the amount of traffic Bugzilla does to the master databases, and the latency in connection setup over that distance, it’s a little bit slow from whichever datacenter isn’t currently hosting the master, so we’ve been trying to keep DNS pointed at just one of them to keep it speedy.

This still works great as a hot failover, though, which got tested in action this last Sunday when we had a system board failure on the master database server in Phoenix.  Failing the entire site over to San Jose took only minutes, and the tech from HP showed up to swap the system board 4 hours later.  The fun part was that I had only finished setting up this hot failover setup about a week prior, so the timing couldn’t have been any better for that system board failure.  If it had happened any sooner we might have been down for a long time waiting for the server to get fixed.

When everything is operational, we’re trying to keep it primarily hosted in Phoenix.  As you can see in the diagram, the database servers in Phoenix are using solid-state disks for the database storage. The speed improvement when running large queries that is gained by using these instead of traditional spinning disks is just amazing.  I haven’t done any actual timing to get hard facts on that, but the difference is large enough that you can easily notice it just from using the site.

Android application wish list: monitor phone status via a web page served from the phone

I have an HTC Evo running Android 2.2 (Froyo).  I absolutely love the thing.  The only downside is the short battery life.  One really cool feature it has is the WiFi HotSpot utility, which lets you turn the phone into a WiFi access point, sharing the 3G or 4G internet connection to the wifi.  It turns out that the Evo has a really good WiFi antenna in it, too.  It has better range as an access point than my Linksys WRT54G at home (even without the fancy antennas sticking out of the back).  This really comes in handy when I’m at my parents’ house up north in the boonies, where there is no broadband internet available, and the cell coverage is spotty at best.  I have to wander the house in search of a spot that manages to maintain a signal long enough to do anything useful with it.  Gone are the days of trying to hold your laptop in some weird spot for the data card to get enough signal.  Now once I find a spot where the phone can keep a signal, I can just leave the phone there, and go wherever I want with the laptop to sit in comfort and use it via the WiFi hotspot.

Now, being that I’m out in the boonies, even the best spot in the house to leave the phone at still has a spotty connection that comes and goes.  Staying somewhere within visual range of the phone so I can watch the signal strength display (and going and waking the phone back up when it puts the screen to sleep) is a pain.  I’d love it if there was an app I could run which would serve up the current phone signal strength and data status over http on the phone’s internal wifi interface (even if it’s on some random port, since you can’t use port 80 without rooting the phone).  For the variety of stuff that’s in the marketplace, it would surprise me if such a thing didn’t already exist.  But it’s also hard enough to describe that any keywords I can think of to search with give me several dozen unrelated hits.  So, Lazyweb, anyone know of such a thing?


Upgrading to version 3.4.3

We’re finally at the point where I can say we’re ready to upgrade Bugzilla @ Mozilla this weekend.  We’re aiming for Sunday evening (probably 6pm PST).  I’ll post again when I know how long it’ll be down for (and that’ll be included in the eventual downtime notice on the IT blog as well).

There’s a staging copy set up at and I would appreciate people playing around with it and finding anything that might be broken before we get it to production.  Before filing bugs, make sure to check the detailed status linked from the red box at the top of every page to make sure it’s not already listed (and you can also see my progress on cosmetic issues and so forth, there).

It will be down for a while at some point tonight when I reload it with an up-to-date snapshot of the production server (and that’ll be my test to find out how long it’ll take to upgrade it, too).  I’m super excited because this has been a long time coming. 🙂

Upgrading from RHEL4 to RHEL5 without a CD or a PXE server

At Mozilla, since our server farm has gotten so big, we’ve gotten reloading and upgrading machines kind of down to a science. We have a PXE server in each colo, and installations and upgrades run over the network. It’s a great system until you get to one of the following two situations:

  1. The machine you need to upgrade is in someone else’s colo, with no local PXE server you have control over and no one on-site to do CD flipping for you. -or-
  2. The machine you need to upgrade *is* the PXE server.

We have situation #2 coming up soon, but I had a machine with situation #1 that I recently experimented on to get this all working without PXE or a CD.

We had a major push to get the entire RHEL4 portion of our infrastructure either reloaded or upgraded to RHEL5 about a year ago. There were several machines that didn’t get upgraded for one of a few reasons at the time. Machines that were due to be replaced soon anyway, machines that ran the RHN Proxies (because RHN Proxy didn’t run on RHEL5 yet at the time), and machines that had software running on them that for some reason didn’t work on RHEL5 yet at the time (Zimbra with clustering support in our case). And also machines that were in remote locations with no PXE and nobody to flip CDs.

RHN has a remote kickstart capability. I’ve experimented with it before, but never had too much success getting it working. I probably just didn’t play enough. But going on the general concept of how it worked, I discovered it was quite easy to get booted into the Anaconda installer from grub… Just copy the isolinux directory off the CD/DVD into /boot, and set up an entry in grub.conf for it that looked very similar to the one in pxelinux.cfg on the PXE server.

The big problem comes with where to have it locate the install media. Conventional wisdom says I’ll get the most bang for the time spent if I put it right on that machine’s hard drive. Only problem: Anaconda doesn’t know how to talk LVM until after it gets stage2.img loaded, and it needs to already have access to the install media to load that. Almost all of these machines have LVM for the main partitions. In the end, I ended up putting an “allow from” line in the apache config on our main kickstart server to allow it to be accessed from this machine’s IP address via the Internet, and just loaded everything over the Internet. Slowed it down quite a bit, but it worked… almost.

Once it got to the point of locating an existing installation to upgrade, it died with “Upgrading between major versions is not supported.” Say what? Last year we upgraded a good couple dozen RHEL4 boxes to RHEL5 in place and it worked just fine (with a few caveats – a few things break, but it’s easy to clean up in %post in your kickstart). Well, RHEL is currently 5.3, it was 5.1 at the time when we did it before. I’d long since discarded the 5.1 images off the kickstart server. So I downloaded the 5.1 DVD again, and staged it on the kickstart server again, adjusted the kickstart file to install 5.1 instead of 5.3, and set it off to do its thing again. This time, success. The machine is now upgraded to RHEL 5.1. Now I just have to use yum to update it the rest of the way to RHEL 5.3.  yum and glibc need to be upgraded first, then everything else.

Since I know someone will try to point it out (someone always does), yes, it is generally better to cleanly install RHEL5 from scratch than to try to ugprade from RHEL4 in place (this is probably why RedHat disabled it on the newer installers). Some situations make that not easy (like this one, where I have no PXE available and nobody local to the machine to flip CDs for me, not to mention the lengthy amount of time the machine would be down if I had to restore a few dozen GB of backups remotely over the Internet after a clean reload). This is another one of those times that makes me wonder why Red Hat can’t be as easy to upgrade as Ubuntu is, which not only has a way to upgrade from one release to the next in place without even needing to boot directly into an installer, but fully supports and encourages that upgrade method. Red Hat has some learning to do here.

We only have 6 machines left on RHEL4 now, and one of those is a build machine that can just go away once we’re no longer having to support RHEL4.  Almost there. 🙂

Help from the community ensures a smooth download experience for Firefox 3.0.8

Most people reading this probably already know that Mozilla utilizes a network of volunteers hosting downloads of Firefox (and other Mozilla products).  All of these volunteer sites are listed in a database with a numeric weight that shows, relative to the other sites in the database, how much traffic they can handle.  When you click the download link for Firefox on the site, you get sent to, which picks one of the sites out of that list at random (the chance of a given site getting picked is its weight divided by the sum of the weights of all of the available sites) and redirects you to that site to download the file.

When you think about the sheer number of people using Firefox these days (These 9-month-old stats say we have 60 million active daily users – I’m sure it’s probably grown since then), and Firefox’s built-in application update functionality that notifies users that a new version is available and installs it for them, that means when we release a security update, we’re going to have at least 60 million downloads of it (mostly via the automatic update service) within that first 24 hours after release.  The amount of bandwidth required to host that many downloads in one day is staggering.  We don’t have that much bandwidth available in Mozilla’s datacenters, which is why we rely on our network of mirror sites for the downloads.  Each one of these sites may only be able to handle a small number of downloads, but when you add them all together, there’s a lot more capacity than our datacenters have. 🙂

Recently, as the number of Firefox users continues to grow, even our network of download mirror sites was starting to feel the pinch from the sheer volume of downloads during our security releases.  During both the Firefox 3.0.6 and 3.0.7 releases, we ended up having to enact a throttling mechanism on the update service in order to slow down the number of downloads being requested to a point where we weren’t completely burying all of our volunteer download sites, many of whom also host downloads for other open source projects besides Firefox.  When a Firefox application would check to see if there’s an update available, a percentage of users were told there wasn’t one, even though there really was.  If you manually picked “Check for Updates” from the Help menu, you always got it, though.  It was only the automatic checks that were throttled.  This mechanism is always our last resort.  When we have a security update, we want it in the end-user’s hands as quickly as possible. Delaying it for a day for a percentage of users is completely counter to that goal, and so we try everything to avoid having to use it.

On Wednesday this last week, when the firedrill started that became the Firefox 3.0.8 release, I sent out an email to all of our download mirror admins, warning them that Firefox 3.0.8 was imminent.  I also pointed out how we had ended up needing to throttle updates during the 3.0.6 and 3.0.7 releases, and saying I still didn’t think we had enough capacity on the download mirror network to handle the release.  Paraphrased, “If you know anyone who’s not mirroring Mozilla yet, and would like to, get them in touch with me.”

The community came through in shining colors. In the 48 hours following that email, we increased the capacity of the download mirror network by more than half.  We left the peak traffic period of the first full day of Firefox 3.0.8 downloads about 6 to 8 hours ago, and I’m quite happy to report that we never had to throttle the updates at all for the Firefox 3.0.8 release.  Every Firefox browser that checked in to see if there was an update available got its update notification.  Not only that, but I had mirror admins telling me on IRC during the peak traffic hours “hey, my site can still handle more traffic, go ahead and bump my weight up some.”  Quite a welcome change from all the reports of dead servers during the last two releases. 🙂

Now, to be fair, we did have one other thing going for us.  This release happened in the afternoon on a Friday.  This effectively splits most of the download traffic between the home users on Saturday and the business users this coming Monday.  Given the way we performed today, though, I’m pretty confident we could have handled the release happening on another weekday anyway.

But, the weekend lull brings up one other point that was raised on IRC by Mike Beltzner yesterday…  As I mentioned above, when we have a security update, the goal is to get it into the hands of the end user as fast as possible.  Mozilla’s QA signed off on the release in the early morning hours on Friday.  The bits were released out to the download mirror sites shortly afterwards.  Enough of those mirror sites had picked up the files by late morning to handle the normal release traffic, but we had to wait until mid-afternoon to release because of a long-standing tradition based on capacity planning to schedule the release at a time of day that will cause the fewest simultaneous downloads to avoid overloading the download mirror network.  Quite simply: the Pacific Ocean covers a lot of timezones.  Enough of them that there’s a general lull in internet traffic when it’s daytime hours over the Pacific Ocean.  If we schedule the release to happen as the west coast of North America is going offline for the day, then we start picking up traffic one timezone at a time as Asia and then eastern Europe start coming online for the new day.

In the interest of getting the update out to the end users as quickly as possible, wouldn’t it be great if we had enough capacity on our download mirror network that we didn’t have to wait for the lull in Internet traffic caused by the Pacific Ocean to do a release?  If we released in the early morning hours in the US, we’d have a large number of timezones online for the day at release time and would get a much larger percentage of the users in the first few hours after release.

So, how about it, community? You guys pulled off some awesomeness this last week. Let’s see if we can do a little bit more, and be able to handle a release without regard for the time of day it happens!  If you know anyone who might be willing to host downloads for us, have them check out our Mirroring Instructions Page.  All the info about how to set up and get included in the download pool is listed there.

Here’s some numbers:  For a normal Firefox security update, I’ve been saying that we need an availability rating of 35000 or higher to handle the release traffic.  That number is the sum of the weights of the available mirrors that currently have the release files.  There’s a loose perception of that number being tied to the amount of available Mbit of download bandwidth, but that’s not really accurate, and depends on a lot of factors.  During both of the Firefox 3.0.6 and 3.0.7 releases, that number was hovering around 26000 most of the time.  For most of the Firefox 3.0.8 release so far, it’s been somewhere between 45000 and 55000.  I’m betting we could probably handle the traffic that would be generated by a morning release if we got that above 65000 consistently.

Replacing Google Groups for Mozilla Newsgroups?

This article is a repost of a previous article, but since I didn’t get very many responses, I figured I’d try again with a more attention-grabbing headline. 🙂

Does anyone know of any decent web interfaces for NNTP out there? Preferably open source that we could host ourselves. There appear to be a LOT of them, so what I’m really asking is, which of all of those are actually any good, and would work for what we need? 🙂 (or be close enough that we could modify it to get the rest of the way there easily)

Currently, Mozilla’s newsgroups are gatewayed to Google Groups, so we can use that as the web interface. Unfortunately, we’ve had continuous problems with spam originating via Google Groups, and there’s very little we can do about it. Google’s policies prevent messages from being deleted unless there’s legal violations (i.e. DMCA notices), so we can’t clean up after it, and as much as they try to fight the spam from happening in the first place, they’re a big target. For the sanity of our newsgroups, we really need to move elsewhere, and hosting it ourselves would really make our lives a lot easier.

Related: bug 425122

Web interfaces for NNTP

Does anyone know of any decent web interfaces for NNTP out there?  Preferably open source that we could host ourselves.  There appear to be a LOT of them, so what I’m really asking is, which of all of those are actually any good, and would work for what we need? 🙂  (or be close enough that we could modify it to get the rest of the way there easily)

Currently, Mozilla’s newsgroups are gatewayed to Google Groups, so we can use that as the web interface.  Unfortunately, we’ve had continuous problems with spam originating via Google Groups, and there’s very little we can do about it.  Google’s policies prevent messages from being deleted unless there’s legal violations (i.e. DMCA notices), so we can’t clean up after it, and as much as they try to fight the spam from happening in the first place, they’re a big target.  For the sanity of our newsgroups, we really need to move elsewhere, and hosting it ourselves would really make our lives a lot easier.

Related: bug 425122

Seven Things

So yeah, I got tagged for this by both Eric Shepherd and Sean Alamares.

Ground rules:
1. Link to your original tagger(s) and list these rules in your post.
2. Share seven facts about yourself in the post.
3. Tag seven people at the end of your post by leaving their names and the links to their blogs.
4. Let them know they’ve been tagged.

On to the seven things you may or may not have known about me:

1. I grew up as the son of a United Methodist pastor.  So yeah, that makes me a PK.  Somehow I managed to avoid falling into either stereotype of that situation (I knew several people who fit one or the other of them though).  United Methodist pastors typically get moved around between churches every few years.  Most of the time, my dad managed to stay put longer than most, so I only ever moved twice with my family before moving out on my own, once in the middle of Kindergarden, and the other time in the middle of 8th grade.  I would never recommend to anyone ever to move their kids in the middle of a school year.  Just don’t.  But I survived. 🙂

2. With the exception of two exchange programs that I participated in, I’ve never lived outside of the state of Michigan.  It’s a great place to live, when the economy doesn’t suck.  Michigan is currently the only state in the U.S. with a two-digit unemployment rate (10.6% for December 2008).  We can thank the failing auto industry for that.  I count myself very fortunate right now that I work for a company that’s still doing well despite the recession.

3. One of the above-mentioned exchange programs was a pastoral exchange when I was 12 years old.  My dad traded churches (and parsonages) with a pastor in Fleetwood, Lancashire, in the UK, for 6 weeks. We went and lived in his house, and he came and lived in ours. It was a pretty fun experience, and the first (and last) time I’ve ever been to a salt water beach.  The tide pools and miles of sand when the tide was out were quite fascinating.

4. The other of the above-mentioned exchange programs was a student exchange just after I graduated from high school.  I went and stayed with a host family in Concepción, Chile, for 2 months.  Yes, that was also on the ocean.  No, I never went to the beach while I was there.  It was the middle of the winter and too cold. 🙂  I had a tremendous amount of fun while I was there, and I didn’t want to come back.  The exchange organization that I had gone through for the exchange also had 6-month and 1-year programs in addition to the 2-month program, and I almost managed to get it extended to 6 months.  The only thing that stopped it from happening is my parents had already paid my tuition for the fall semester at college.  I was tremendously shy as a kid, and never had very many friends, mostly because I was too shy to make them.  I wholeheartedly credit this trip with bringing me out of my shell. 🙂

5. I met my wife while hiding in a dormitory basement with 80 other people during a tornado warning 3 weeks into that fall semester my freshman year at college.  I guess it’s a good thing I did come back from Chile when I did. 😉  It was about a year later before we were seriously dating though, and several months after that before we decided to get married.  We held off until after she graduated to get married.  A few months from now we’ll have been married for 15 years, and I love her now more than ever.  We have 2 children, who are now in 1st and 4th grades, and are absolute joys to be parents of… most of the time. 🙂

6. I never graduated from college.  I was working toward a Computer Science degree, and the computer area at Adrian College was pretty much falling apart around my Junior year, for both a lack of qualified faculty and limited number of participating students.  All of the computer classes there were considered part of the Math department at the time.  This wasn’t exactly a good fit.  They had three professors there who actually knew what they were doing with computers.  One of them was the chairman of the Political Science department (and thus only taught one or two computer classes).  One of them was the chairman of the Chemistry department (and thus only taught one or two computer classes).  The third was actually full time in the computer department, but was a native of India, and didn’t have a very good command of the English language, so you couldn’t understand anything he lectured about.  The remaining professors were all math teachers, and didn’t really understand computers well.  I understand that they split computers off to its own department and had a huge push on modernizing it with equipment and qualified faculty not long after I left, but it was already too late for me.  Also adding to the mix, I had gone in with a friend on an off-campus apartment, hoping to get cheaper housing.  My roommate ended up backing out on it after the lease had been signed, so I got stuck with the apartment by myself (which was no longer cheaper as a result).  This meant I had to go get a real job (rather than just a student job on campus) to pay for the rent, and homework of course suffered, and eventually there was no point in continuing school.  After we got married, Lori moved into that apartment with me.  But as strange as it seems, that lowly job working in the hardware deparment at the local Meijer store did actually lead to a career working with computers.  It took seven years to get there, working a little way up and down the chain within Meijer, but it did.  There’s enough meat there for a whole other blog post (or you can just go read my bio on the About page linked at the top, most of it’s in there 🙂 )

7. I’m a huge fan of Asian media, mostly anime.  My taste is mostly in high school dramas, fantasy, scifi, magical girls, and slice-of-life stuff.  I tend to avoid mecha (which is what most people think of when they think of anime for some reason) and Naruto-style stuff.  My current favorites (minus a few) are listed over on the right on my blog.  You can find more (and some of the older stuff) if you dig around in the Anime category on my blog.  The Anime industry is in the middle of a huge upheaval right now, with many of the publishers starting to catch on to online distribution.  Personally I think it’s a great time to be a fan…  having more and more places to go to get good shows right from the publishers.

Tag, you’re it!

Actually, after looking around a little, I can’t find anyone with a blog who hasn’t already been tagged for this, so I guess it’s time to let it die.  It tends to get out-of-hand if you let it grow loosely anyway.  I’ve seen this mème going around Facebook listing both 16 and 25 as the number of things and people to list.  Consider yourself fortunate that the Mozilla community managed to keep it at 7. 🙂

Serving AppleShare from RHEL5 with Netatalk 2.0.3

So I was recently trying to set up a fileshare in one of our offices and trying to get it visible to the filesharing stuff in Mac OS X, since several people in the office have Mac laptops.  The original thought (since it’s supposedly better-supported on Linux) was to set up Samba, but our authentication in the office is all LDAP based, and I gave up trying to get Samba to work with our LDAP server after a few days.  Samba seems to want complete control over your LDAP server, and won’t deal with a read-only one that just happens to have all the Samba auth info in it already.  This seems wrong, and I’m sure there’s a way to do it, but I sure couldn’t find any documentation to tell me how.

So then I thought maybe I’d try Netatalk.  None of the usual packaging repos seemed to carry a netatalk RPM, but I did find one for Netatalk 2.0.3 in Fedora 8.  I took the SRPM from that and rebuilt it on my RHEL5 server.  Then I went about trying to configure it.  Turns out the documentation for Netatalk SUCKS ROCKS.  Everything I could find was written in 1998 and last touched in 2002 or so, and there’s been several new versions of Netatalk since then.  When all was said and done, the configuration part turned out to be really easy, you just couldn’t figure it out from the docs.

I did find a tutorial for setting up Netatalk for TimeMachine on Ubuntu, which turned out to be incredibly helpful.  So my main reason for blogging about this is to help that tutorial get some more pagerank, since it wasn’t nearly high enough in the search results on Google. 🙂

So without further ado, here’s the Netatalk How-to for Ubuntu that I found. update

On Friday, I pushed a small update to that fixed bug 452799, where users who didn’t have ‘canconfirm’ privs in Bugzilla were posting bugs that had a status of NEW rather than UNCONFIRMED.

This morning, I pushed an update to containing a plethora of additional fixes to address concerns raised since the Bugzilla upgrade.  This morning, we’ve picked up fixes for:

  • Bug 452793: (The other half of the issue which was fixed Friday) The default status selected when you file a new bug and do have ‘canconfirm’ privs is now NEW instead of UNCONFIRMED.
  • Bug 452810: The wording surrounding the checkbox to add youself to the CC now says “Add me to the CC list” when you aren’t on it, instead of just “myself.”
  • Bug 452734: The keyword chooser has been replaced with keyword autocomplete.  NOTE: If you installed the greasemonkey script to remove the keyword chooser, you’ll probably have to remove that script to get the autocomplete, since it hooks on the same event listener.
  • Bug 452798: The CC list is now visible again by default, and as a bonus, it’s now searchable via Firefox’s find-as-you-type feature.
  • Bug 452733: The [Classification] is no longer shown in front of the bug summary.
  • Bug 452746: The link to the bug in the header no longer contains an extra space.
  • Bug 452891: The “visually jarring” dashed border next to the line numbers in the Diff Viewer has been removed.
  • Bug 452749: The midair page once again specifies who you midaired with.
  • Bug 344559: Add a Commit button near the form fields at the top of the show_bug page so you don’t have to scroll to the bottom of the comments if you’re only changing a field at the top.

Fixes for admins:

  • Bug 452898: Milestones can once again be marked inactive.
  • Bug 452914: Multiple problems were fixed in the flag editor related to the “fixed in version” field not being dealt with correctly on a product change.

Hopefully this fixes up some of the more major concerns people had.  There’s still more to come.  At this point I’m plannng on daily pushes to production as the fixes become available.

UPDATE: Some people are reporting broken CSS and things looking strange…  hold the Shift key and hit Reload if that’s you.  Your browser is probably caching the old CSS.