We’re hoping to get some Bugzilla folks hanging out at the Mozilla booth at OSCON this year. Mozilla is blocking off a number of 2 hour slots throughout Wednesday and Thursday where they’ll be advertising specific topics to be discussed at the booth, and we’ll be doing at least one of those, and would like as many Bugzilla folks as possible there during that time slot (probably 3:30 – 5:30pm on Thursday, but yet to be determined). It would also be nice to generally have at least one Bugzilla person there throughout both days (doesn’t have to be the same person the entire time 😉 ). If you’re planning to be at OSCON and are willing to help out with staffing the booth, let me know.
So yesterday afternoon, Alex Faaborg blogged about some new features in Firefox 3. No big deal until it got posted on digg.com. The blog server could take it. It’s in the load balancing cluster behind a caching proxy server, which doesn’t even notice this kind of traffic. But Alex had posted his images in his personal space on people.mozilla.com, which is a single server which isn’t really considered production critical on IT’s priority list. Now, even though this is a single server, it’s not exactly sucky hardware. The machine should have been more than capable of handling a slashdotting and getting dugg at the same time. So we were all pretty surprised when it fell over.
Apache kept dying, and spitting out errors about failing to setuid to the apache user. After much banging of heads, Justin Dolske found a relevant forum post in one of the Gentoo forums of all places, which pointed the finger at per-user process limits, and using ulimit in the initscript to override them. Using ulimit turned out not to be necessary, but it did get me looking in the right places.
Mozilla employees get shell accounts on people.mozilla.com (makes it easier for them to manage the webspace there, and several folks use it to run irssi in screen to keep a session to irc.mozilla.org open). In order to keep users from bogging down the machine, we had used pam_limit to limit user logins to 100 processes per user in /etc/security/limit.conf. Well, it turns out that this limit applies to both root and apache as well. So when apache spawned that 100th process to handle that many concurrent connections, it hit that limit and died. Now, root is immune to process limits, however, limits set for root still apply to any setuid processes spawned by root, if that limit is lower than the user being setuid to. So setting a specific (higher) limit for apache in limit.conf wasn’t enough. Had to bump it up for root as well.
But that did the job. The site was back up in no time, happily serving all the images any Digg user could want to go with Alex’s blog, and still keeping an 0.03 load average. Next time someone’s images get posted to Digg or Slashdot, people.mozilla.com will be ready.