Tim Starling wrote:
Some general thoughts about this while it's on my
mind: the key here to
minimising the impact of this kind of problem is isolation, rather than
distribution. We already have good isolation for search, and improving
isolation for images -- if one of those two services goes offline then the
rest should stay up, unaffected.
As background for those not familiar, here's the situation with search:
The actual search work is performed by a daemon using the Lucene search library.
This is running on three servers, which our main PHP application servers can
contact over HTTP internally. If the HTTP request is rejected or times out (and
the timeout is obscenely short), the PHP side tries a couple more servers, until
it either finds one that works or runs out and gives up.
So if the search servers are all overloaded or down, you just get a nice little
error message and are offered the chance to use an external search
(google/yahoo/etc). No immediate gratification, but the site stays up.
When we first tried this system, the timeout and failover wasn't yet used -- if
the daemon encountered certain exceptions or got overloaded it would leave
connections hanging for a long time., All the available threads would fill up on
the apaches; a hundred php processes just waiting on their search results...
*kaboom*
The image fileserver is currently a potential problem, as the application
servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and
if the server goes down it tends to hang for long periods of time, with similar
problem results.
Improvements to this could include minimizing our contact with the file server
(avoid unnecessary reads and checks for file existence; we've got a damn
database) and potentially using some more explicit file upload protocol which
can fail gracefully.
Maybe it's time we introduced a "basic"
query group, containing those queries required for pages views. Then we
could send all "basic" queries to a dedicated cluster, and all other queries
to a second isolated cluster. Then as long as we can keep the apache thread
count low enough, any problem with those diverse special page queries would
not affect page view performance.
Probably wise.
We could go even further and split the apache cluster
into an "ordinary page
view" cluster and an "everything else" cluster. This would mitigate DoS
attacks on apache resources.
Slightly less trivial, but probably doable.
-- brion vibber (brion @
pobox.com)