-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org... http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
1. There’ve been ongoing database locking issues with the site statistics updates — these would all block on each other, making page saves very slow at times 2. … which held open database connections, causing the text storage servers to start locking out new connections … 3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.
The code changes have been rolled back, fixing the slow site load behavior. (doing this correctly unfortunately was a bit painful, as we had to restore the broken code for a while in order to pick out what was going on enough to fully revert it again.)
Domas believes the main culprit on the database locking is actually an issue with our mail server — some actions (such as creation of new accounts) would involve both mail and updates to the site statistics table. With overload to the mail server, and a very simple local mail client called from MediaWiki, the outgoing mail would sometimes hang, while the transaction was still open, causing the locks, causing other updates to stall.
As a temporary measure I’ve disabled the site stats updates, fixing the failures on page save. (They’ll need to be re-updated after we’ve totally resolved it.)
We’re looking at the way the mail servers are set up to see if we can ensure that internal connections don’t stall the way they were; we should also be able to rearrange the transactions so that things are committed before the mail goes out!
- -- brion
Brion Vibber wrote:
We’re looking at the way the mail servers are set up to see if we can ensure that internal connections don’t stall the way they were; we should also be able to rearrange the transactions so that things are committed before the mail goes out!
Why does mediawiki need to wait on sendmail? If there's an error sending the email it will most likely be on the user end, such as accounts no longer existing. No need to have apaches waiting on sendmails correctly configured.
Indeed, some kind of queue that accepts immediately would seem most logical.
Sent from my phone, Jeff
On Sep 24, 2008, at 16:35, Platonides Platonides@gmail.com wrote:
Brion Vibber wrote:
We’re looking at the way the mail servers are set up to see if we can ensure that internal connections don’t stall the way they were; we should also be able to rearrange the transactions so that things are committed before the mail goes out!
Why does mediawiki need to wait on sendmail? If there's an error sending the email it will most likely be on the user end, such as accounts no longer existing. No need to have apaches waiting on sendmails correctly configured.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Platonides wrote:
Why does mediawiki need to wait on sendmail? If there's an error sending the email it will most likely be on the user end, such as accounts no longer existing. No need to have apaches waiting on sendmails correctly configured.
Most of the Apache servers have SSMTP installed as the local MTA. This is a fairly "dumb" MTA which does no local queueing and simply passes things off to another server. This is nice and simple most of the time, but indeed if your mailhub doesn't respond in a timely fashion... not so good. :)
- -- brion
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org... http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
- There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page saves very slow at times 2. … which held open database connections, causing the text storage servers to start locking out new connections … 3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
-- Tim Starling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
Doing the revert totally changed the performance characteristics of the site, moving it from sitting around timing out to *being* readable.
I'm not sure what part was the problem, but something was definitely wrong...
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
Well, my main concern there is that if operations are weirdly ordered you can end up with a total "transaction" half-committed... on the other hand, these are done in deferred updates. They're in theory meant to be something that won't kill ya if it fails, otherwise they'd have been... not... deferred.
Either we need to rethink the old deferred updates system entirely and turn them into immediate applications, or we should make them operate as separate transactions (and potentially restartable in case they separately get rolled back or deadlocked).
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
It's probably easier to just bump the connection limits on ES to match or exceed the core DBs. The actual activity should never be very expensive, so a sleeping connection won't hurt much.
- -- brion
Brion Vibber wrote:
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
It's probably easier to just bump the connection limits on ES to match or exceed the core DBs. The actual activity should never be very expensive, so a sleeping connection won't hurt much.
- -- brion
Wouldn't that mean that on next failure, core dbs will fail before external storage and the whole site will be unavailable instead of just read-only?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Platonides wrote:
Brion Vibber wrote:
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
It's probably easier to just bump the connection limits on ES to match or exceed the core DBs. The actual activity should never be very expensive, so a sleeping connection won't hurt much.
- -- brion
Wouldn't that mean that on next failure, core dbs will fail before external storage and the whole site will be unavailable instead of just read-only?
Not necessarily, it might fail in some entirely different and exciting way. :)
- -- brion
Wouldn't it be a good idea to put things such as emails and stats updates into the job queue? (all stats updates could be under one job type, with just a parameter to decide what).
Then the slowness would be handled by the job runners, letting edits come through quickly. Since we're not doing it in-transaction anyway, there shouldn't be a big problem with it (we could probably do the same for logging, although it's not as important).
Assuming the job runners properly free connections, they shouldn't have any open connections except the one they are currently using to update the stats (and in the case of emails, no db connections at all if we pass the data through in parameters, or if we connect, grab it, then disconnect before even starting the email).
This would probably help lower the cost of stats updates, and stop emails from holding DB connections at all. It's probably a bit of treating the symptoms not the problem, but it would work for now.
- mattj
-------------------------------------------------- From: "Tim Starling" tstarling@wikimedia.org Sent: Thursday, September 25, 2008 3:18 PM To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Page saving slowness and some loading breakagetoday
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org... http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:
- There’ve been ongoing database locking issues with the site
statistics updates — these would all block on each other, making page saves very slow at times 2. … which held open database connections, causing the text storage servers to start locking out new connections … 3. … which exacerbated problems with the failover behavior of recent changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Matt Johnston wrote:
Wouldn't it be a good idea to put things such as emails and stats updates into the job queue? (all stats updates could be under one job type, with just a parameter to decide what).
I think that would be overkill. Site stats is not slow, it takes a couple of milliseconds at most. Email does not need to be slow, it's just unreliable due to a sysadmin issue, namely the lack of a local queue. Not everyone has a working job queue, Wikimedia included. The effect of delaying such jobs via the job queue would be user-visible, and there would be a complexity penalty on the software as well.
-- Tim Starling
On Wed, Sep 24, 2008 at 11:40 PM, Tim Starling tstarling@wikimedia.org wrote:
Matt Johnston wrote:
Wouldn't it be a good idea to put things such as emails and stats updates into the job queue? (all stats updates could be under one job type, with just a parameter to decide what).
I think that would be overkill. Site stats is not slow, it takes a couple of milliseconds at most.
What's the issue with it, then?
Aryeh Gregor wrote:
On Wed, Sep 24, 2008 at 11:40 PM, Tim Starling tstarling@wikimedia.org wrote:
Matt Johnston wrote:
Wouldn't it be a good idea to put things such as emails and stats updates into the job queue? (all stats updates could be under one job type, with just a parameter to decide what).
I think that would be overkill. Site stats is not slow, it takes a couple of milliseconds at most.
What's the issue with it, then?
The issue is that under certain circumstances, MediaWiki locks the table for seconds at a time, blocking all other updates during that period. The circumstances are not precisely known, but there are plenty of candidates.
-- Tim Starling
wikitech-l@lists.wikimedia.org