-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Starling wrote:
I did see something like this before, and the reason I didn't revert the ES changes is because they weren't the issue, and the fact that ES master went down first allowed the site to continue in read-only mode. You could have just increased the max connections on the ES masters, for the same effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time, obviously I hadn't.
Doing the revert totally changed the performance characteristics of the site, moving it from sitting around timing out to *being* readable.
I'm not sure what part was the problem, but something was definitely wrong...
I think the ES load balancing changes were useful, and are a good way to progress towards higher availability. I think a better way to fix the site_stats contention would have been to insert an unconditional COMMIT in SiteStatsUpdate::doUpdate().
Well, my main concern there is that if operations are weirdly ordered you can end up with a total "transaction" half-committed... on the other hand, these are done in deferred updates. They're in theory meant to be something that won't kill ya if it fails, otherwise they'd have been... not... deferred.
Either we need to rethink the old deferred updates system entirely and turn them into immediate applications, or we should make them operate as separate transactions (and potentially restartable in case they separately get rolled back or deadlocked).
If the connection count on the ES master really is a problem (not just a symptom of a much larger problem), then that can be mitigated by closing the connections early. But I think the only reason we're seeing this come out on the ES servers is because they have the lowest number of maximum connections, so they fail first.
It's probably easier to just bump the connection limits on ES to match or exceed the core DBs. The actual activity should never be very expensive, so a sleeping connection won't hurt much.
- -- brion