Wouldn't it be a good idea to put things such as emails and stats updates into the job queue? (all stats updates could be under one job type, with just a parameter to decide what).
Then the slowness would be handled by the job runners, letting edits come through quickly. Since we're not doing it in-transaction anyway, there shouldn't be a big problem with it (we could probably do the same for logging, although it's not as important).
Assuming the job runners properly free connections, they shouldn't have any open connections except the one they are currently using to update the stats (and in the case of emails, no db connections at all if we pass the data through in parameters, or if we connect, grab it, then disconnect before even starting the email).
This would probably help lower the cost of stats updates, and stop emails from holding DB connections at all. It's probably a bit of treating the symptoms not the problem, but it would work for now.
- mattj
-------------------------------------------------- From: "Tim Starling" tstarling@wikimedia.org Sent: Thursday, September 25, 2008 3:18 PM To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Page saving slowness and some loading breakagetoday
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org... http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
- There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page saves very slow at times 2. … which held open database connections, causing the text storage servers to start locking out new connections … 3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l