So yesterday afternoon, Alex Faaborg blogged about some new features in Firefox 3. No big deal until it got posted on digg.com. The blog server could take it. It’s in the load balancing cluster behind a caching proxy server, which doesn’t even notice this kind of traffic. But Alex had posted his images in his personal space on people.mozilla.com, which is a single server which isn’t really considered production critical on IT’s priority list. Now, even though this is a single server, it’s not exactly sucky hardware. The machine should have been more than capable of handling a slashdotting and getting dugg at the same time. So we were all pretty surprised when it fell over.
Apache kept dying, and spitting out errors about failing to setuid to the apache user. After much banging of heads, Justin Dolske found a relevant forum post in one of the Gentoo forums of all places, which pointed the finger at per-user process limits, and using ulimit in the initscript to override them. Using ulimit turned out not to be necessary, but it did get me looking in the right places.
Mozilla employees get shell accounts on people.mozilla.com (makes it easier for them to manage the webspace there, and several folks use it to run irssi in screen to keep a session to irc.mozilla.org open). In order to keep users from bogging down the machine, we had used pam_limit to limit user logins to 100 processes per user in /etc/security/limit.conf. Well, it turns out that this limit applies to both root and apache as well. So when apache spawned that 100th process to handle that many concurrent connections, it hit that limit and died. Now, root is immune to process limits, however, limits set for root still apply to any setuid processes spawned by root, if that limit is lower than the user being setuid to. So setting a specific (higher) limit for apache in limit.conf wasn’t enough. Had to bump it up for root as well.
But that did the job. The site was back up in no time, happily serving all the images any Digg user could want to go with Alex’s blog, and still keeping an 0.03 load average. Next time someone’s images get posted to Digg or Slashdot, people.mozilla.com will be ready.