-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Posted this summary on blog, going out to
en.planet.wikimedia.org...
http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
1. There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page
saves very slow at times
2. … which held open database connections, causing the text storage
servers to start locking out new connections …
3. … which exacerbated problems with the failover behavior of recent
changes to the storage and load balancing code.
The code changes have been rolled back, fixing the slow site load
behavior. (doing this correctly unfortunately was a bit painful, as we
had to restore the broken code for a while in order to pick out what was
going on enough to fully revert it again.)
Domas believes the main culprit on the database locking is actually an
issue with our mail server — some actions (such as creation of new
accounts) would involve both mail and updates to the site statistics
table. With overload to the mail server, and a very simple local mail
client called from MediaWiki, the outgoing mail would sometimes hang,
while the transaction was still open, causing the locks, causing other
updates to stall.
As a temporary measure I’ve disabled the site stats updates, fixing the
failures on page save. (They’ll need to be re-updated after we’ve
totally resolved it.)
We’re looking at the way the mail servers are set up to see if we can
ensure that internal connections don’t stall the way they were; we
should also be able to rearrange the transactions so that things are
committed before the mail goes out!
- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org
iEYEARECAAYFAkjamZsACgkQwRnhpk1wk45STQCfTkw4Goq2N96nj5uSYSMLoJ/G
z6gAnicZzMjlVbaVUxtNGt8Rkgyd/yui
=aEqy
-----END PGP SIGNATURE-----