Though the were very interesting hardware fireworks yesterday, the
actual reason of slowdown was way way more prosaic.
First of all, site didn't crash, it just gradually slowed down. It
took us a while to actually notice the slowdown (at least half an hour
since the problem started).
Second, mostly it wasn't our mediawiki/extensions codebase hitting the
issue. Though of course, there were some extensions that could've
triggered same behavior, the reason was a little bit more complicated
from development perspective (or easier from system administration
perspective).
PHP calls external programs using 'sh -c', which doesn't simply check
current directory by checking getcwd(), but it also gets environment
variable $PWD too.
This is where interesting part begins, Apache does change current
working directory, but doesn't change environment variable, when it is
started.
We regularly start and stop and start and stop our application
servers, and usually that is done while being in ~ (which is on NFS).
What happened then, is that sometimes apache children call external
programs, so some requests end up blocking on NFS. This consumes more
and more of worker processes, until there're none left to serve the
site.
So, in the end, it is a mixture of unexpected behavior, incomplete
behavior, NFS suck, etc.
We didn't have strong push to have HA-NFS simply because our
application does not rely on it too much anymore.
We just didn't know that OS can give us surprises like that ;-)
(and the hardware issues were resolved by flashing service processor /
BIOS / RAID controller with new firmware, or so it seems, maybe even
harder reboot helped - might be that we still have some issues there,
but at least they will be more manageable).
--
Domas Mituzas --
http://dammit.lt/ -- [[user:midom]]