Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org... http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
- There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page saves very slow at times 2. … which held open database connections, causing the text storage servers to start locking out new connections … 3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
-- Tim Starling