The site was offline recently for about 20-30 minutes, with some additional downtime of uploads only, while our upload fileserver amane was broken.
Quick summary of affairs before I run off to dinner:
* amane's mount of izwinger:/home had broken in some way, such that accesses were hanging ** amane's syslog shows a large number of RPC failures for zwinger's NFS for the last few hours
* user ssh logins to amane failed due to the broken /home * lighttpd ran out of connections, with lots of stuck php processes, likely because thumbnail rendering used files on /home
* amane's nfs server still worked, so the site ran internally * root ssh login worked, and i was able to kill lighty and remount /home
* however shortly after I tried restarting lighty, it died more thoroughly: i was unable to continue the ssh session (stuck) and new ssh sessions didn't get past opening port 22 * at this point amane's nfs died too * can't find anything in syslog relating to that
* from this point the whole site was broken * there's a donation link on the error page, which points to a wiki page so it's also broken * tried to change the error page to link to the separate fundraising server, but the update didn't quite take before we finsihed
* we had the colo reboot the machine * they had to call us back for more info because the machine was not properly labeled * amane is not on the serial console server!
* after rebooting, things settled down after a few minutes * site seems ok at the moment
Recommendations for future: * make sure all servers are marked * important machines *must* be on the serial console when installed * the site should still work if images are offline. check code that works with image files to make it fail more gracefully * check NFS mount settings, try to set them up to a more failure-friendly way
and of course * try to get a backup image server online * have a way to switch to it automatically
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org