Re: [Wikitech-l] Page saving slowness and some loading breakagetoday

25 Sep 2008


      Wouldn't it be a good idea to put things such as emails and stats updates 
into the job queue? (all stats updates could be under one job type, with 
just a parameter to decide what).
Then the slowness would be handled by the job runners, letting edits come 
through quickly. Since we're not doing it in-transaction anyway, there 
shouldn't be a big problem with it (we could probably do the same for 
logging, although it's not as important).
Assuming the job runners properly free connections, they shouldn't have any 
open connections except the one they are currently using to update the stats 
(and in the case of emails, no db connections at all if we pass the data 
through in parameters, or if we connect, grab it, then disconnect before 
even starting the email).
This would probably help lower the cost of stats updates, and stop emails 
from holding DB connections at all. It's probably a bit of treating the 
symptoms not the problem, but it would work for now.
- mattj
--------------------------------------------------
From: "Tim Starling" tstarling@wikimedia.org
Sent: Thursday, September 25, 2008 3:18 PM
To: wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] Page saving slowness and some loading 
breakagetoday
...
Brion Vibber wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org...
http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of 
things:

There’ve been ongoing database locking issues with the site

statistics updates — these would all block on each other, making page
saves very slow at times
   2. … which held open database connections, causing the text storage
servers to start locking out new connections …
   3. … which exacerbated problems with the failover behavior of recent
changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the
ES changes is because they weren't the issue, and the fact that ES master
went down first allowed the site to continue in read-only mode. You could
have just increased the max connections on the ES masters, for the same
effect. The connection count on the core master would have overflowed 
instead.
But I did think I had found the root cause of the problem at the time,
obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to
progress towards higher availability. I think a better way to fix the
site_stats contention would have been to insert an unconditional COMMIT in
SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a
symptom of a much larger problem), then that can be mitigated by closing
the connections early. But I think the only reason we're seeing this come
out on the ES servers is because they have the lowest number of maximum
connections, so they fail first.
-- Tim Starling

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Page saving slowness and some loading breakagetoday