-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Quick report on today's downtimes...
http://leuksman.com/log/2008/09/22/wikipedia-downtime-2x-today/
Well, today was exciting! Wikimedia’s sites experienced two downtime
events today.
The first, which lasted about 30 minutes, was due to a power problem.
While Rob was performing maintenance fixing up power in rack B2, power
was inadvertently shut off to an access switch serving another rack of
servers, which took a chunk of our core text storage offline.
The second, which also lasted about 30 minutes, was caused by a file
server failure. The file server that holds our NFS home directories and
misc files and logs experienced a kernel crash, then turned up some disk
errors on reboot. (Possibly two failed drives, which may hose the array.)
Ideally this wouldn’t disturb production web serving, but various
debugging logs were being saved onto this server, and this caused the
web servers to hang waiting for NFS to come back up.
We’ve disabled the internal debug logging for now, and the site’s back
up and running while we poke at recovering or replacing the file server.
Both of these problems can be ameliorated in the future with some more
failure-proof design:
* Spreading text storage clusters across multiple racks will protect
against localized power or network failures
* Moving debug logs to a UDP system will have a more graceful
failure mode for centralized logging than hanging NFS shares
- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org
iEYEARECAAYFAkjX6hwACgkQwRnhpk1wk45keQCeIjGLygMHk5/8Uk2JmpYyCS9y
FygAoI2XFgVEmIvEiA0sTw2No8qo57a3
=xmub
-----END PGP SIGNATURE-----