Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Posted this summary on blog, going out to
en.planet.wikimedia.org...
http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
1. There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page
saves very slow at times
2. … which held open database connections, causing the text storage
servers to start locking out new connections …
3. … which exacerbated problems with the failover behavior of recent
changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the
ES changes is because they weren't the issue, and the fact that ES master
went down first allowed the site to continue in read-only mode. You could
have just increased the max connections on the ES masters, for the same
effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time,
obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to
progress towards higher availability. I think a better way to fix the
site_stats contention would have been to insert an unconditional COMMIT in
SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a
symptom of a much larger problem), then that can be mitigated by closing
the connections early. But I think the only reason we're seeing this come
out on the ES servers is because they have the lowest number of maximum
connections, so they fail first.
-- Tim Starling