Tim Starling wrote:
Some general thoughts about this while it's on my mind: the key here to minimising the impact of this kind of problem is isolation, rather than distribution. We already have good isolation for search, and improving isolation for images -- if one of those two services goes offline then the rest should stay up, unaffected.
As background for those not familiar, here's the situation with search:
The actual search work is performed by a daemon using the Lucene search library. This is running on three servers, which our main PHP application servers can contact over HTTP internally. If the HTTP request is rejected or times out (and the timeout is obscenely short), the PHP side tries a couple more servers, until it either finds one that works or runs out and gives up.
So if the search servers are all overloaded or down, you just get a nice little error message and are offered the chance to use an external search (google/yahoo/etc). No immediate gratification, but the site stays up.
When we first tried this system, the timeout and failover wasn't yet used -- if the daemon encountered certain exceptions or got overloaded it would leave connections hanging for a long time., All the available threads would fill up on the apaches; a hundred php processes just waiting on their search results... *kaboom*
The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.
Improvements to this could include minimizing our contact with the file server (avoid unnecessary reads and checks for file existence; we've got a damn database) and potentially using some more explicit file upload protocol which can fail gracefully.
Maybe it's time we introduced a "basic" query group, containing those queries required for pages views. Then we could send all "basic" queries to a dedicated cluster, and all other queries to a second isolated cluster. Then as long as we can keep the apache thread count low enough, any problem with those diverse special page queries would not affect page view performance.
Probably wise.
We could go even further and split the apache cluster into an "ordinary page view" cluster and an "everything else" cluster. This would mitigate DoS attacks on apache resources.
Slightly less trivial, but probably doable.
-- brion vibber (brion @ pobox.com)