Fun with proxy servers

This last week saw the Firefox 1.0.4 security update firedrill. When the exploit in question was leaked, and it was noticed that it was exploiting the default extension install whitelist which included the site by default, we decided to redirect all traffic to that site to another domain outside the whitelist in order to short-circuit the exploit. We also did some fiddling with the mime types on the FTP servers so clicking a link to an extension on the addons site would trigger a download of the extension file instead of automatically installing it.

Once Firefox 1.0.4 was out, with the security holes fixed, we could undo all of that and put the site back how it was. With one exception… the security hole still affects users of Firefox 1.0.3 and older. So we now sniff the UserAgent and redirect anyone using 1.0.3 or older to a page on telling them they need to upgrade. Yes, we know UserAgents can be spoofed. We also figure that the people who are enough of a poweruser to spoof their UserAgent are probably enough of a poweruser to know they need to upgrade on their own, and this still blocks the default case of your mom running Firefox unaltered.

Squid (which we’ve been using for our proxy servers for the addons site) can’t do redirects based on a UserAgent. So we built an RPM of Apache 2.1.3 for RHEL 4, and installed that on two of the new servers, using mod_proxy and mod_cache, and got lots of help from Paul Querna (a developer on the Apache httpd project) setting it up.

I must say, the new proxy and caching features in Apache are pretty freaking sweet. You get a heck of a lot more control over the way the content is proxied, can have multiple backend servers split up by subdirectory under the same domain name, can even serve content locally in addition to proxying. You can efficiently issue 302 and 301 redirects from the proxy server itself instead of having to have hundreds of threads from a rewrite engine running in the background or having to pass them through to the back-end server. Combining the power of mod_rewrite with the power of mod_proxy and mod_cache is a beauty to behold. My initial reaction to all of these features was “gee, it must have a performance cost compared to squid”, but it seems to be keeping up with our traffic just fine so far.

One Hour of Terror

“One hour of terror” — This is how we’ve jokingly started referring to the first hour of every month (measured on GMT) because of the bug in the 1.0 version of Firefox which causes it to only check for updates between the first of the month and the first Sunday of the month. Firefox checks with itself once per hour to see whether it’s been long enough since the last time it checked the server for updates to check again. And any version 1.0 of Firefox that happens to be running at midnight GMT is going to have that check fire within the first hour of the clock ticking over past midnight. This absolutely SLAMs our servers with every known copy of Firefox 1.0 (shame on people for not upgrading) checking in during that first hour instead of how they’re usually spread across the entire day.

Today is May 1st. Last night, at midnight GMT, was that hour. See the bandwidth graphs. We’d been hoping to have our new hardware set up already by now, but it just arrived this last week, and we haven’t had time to configure it yet. Sky (which has been handling the application update service all by itself until now) took a beating right at midnight (17:00 on the charts I linked above). Within a few minutes, I managed to clone the webserver configuration onto Star (a machine about to be deployed for use by the Talkback services, but which the Talkback folks haven’t actually set anything up on yet) and added Star to the rotation so the requests were split between Sky and Star.

The scary part? When you look at those graphs, you’ll notice Star’s bandwidth skyrocket when it was added to the rotation, but Sky’s bandwidth didn’t go down at all. This means Star didn’t ease any load off of Sky at all, it just picked up load that hadn’t been making it through to begin with. Ooof.

We served over a million requests during that first hour (850,000 on sky and 200,000 on star), and about half a million during the second hour (split close to evenly). And the graphs make it plain to see that we weren’t serving all the requests that were coming in.

Next month we’ll be ready for it.