Though the were very interesting hardware fireworks yesterday, the actual reason of slowdown was way way more prosaic.
First of all, site didn't crash, it just gradually slowed down. It took us a while to actually notice the slowdown (at least half an hour since the problem started).
Second, mostly it wasn't our mediawiki/extensions codebase hitting the issue. Though of course, there were some extensions that could've triggered same behavior, the reason was a little bit more complicated from development perspective (or easier from system administration perspective).
PHP calls external programs using 'sh -c', which doesn't simply check current directory by checking getcwd(), but it also gets environment variable $PWD too. This is where interesting part begins, Apache does change current working directory, but doesn't change environment variable, when it is started. We regularly start and stop and start and stop our application servers, and usually that is done while being in ~ (which is on NFS).
What happened then, is that sometimes apache children call external programs, so some requests end up blocking on NFS. This consumes more and more of worker processes, until there're none left to serve the site.
So, in the end, it is a mixture of unexpected behavior, incomplete behavior, NFS suck, etc. We didn't have strong push to have HA-NFS simply because our application does not rely on it too much anymore. We just didn't know that OS can give us surprises like that ;-)
(and the hardware issues were resolved by flashing service processor / BIOS / RAID controller with new firmware, or so it seems, maybe even harder reboot helped - might be that we still have some issues there, but at least they will be more manageable).
wikitech-l@lists.wikimedia.org