Re: [Wikitech-l] Page saving slowness and some loading breakage today

25 Sep 2008


      Brion Vibber wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Posted this summary on blog, going out to en.planet.wikimedia.org...
http://leuksman.com/log/2008/09/24/why-is-everything-broken-this-week/
We’ve tracked down today’s problems to a combination of a couple of things:

There’ve been ongoing database locking issues with the site

statistics updates — these would all block on each other, making page
saves very slow at times
   2. … which held open database connections, causing the text storage
servers to start locking out new connections …
   3. … which exacerbated problems with the failover behavior of recent
changes to the storage and load balancing code.
I did see something like this before, and the reason I didn't revert the
ES changes is because they weren't the issue, and the fact that ES master
went down first allowed the site to continue in read-only mode. You could
have just increased the max connections on the ES masters, for the same
effect. The connection count on the core master would have overflowed instead.
But I did think I had found the root cause of the problem at the time,
obviously I hadn't.
I think the ES load balancing changes were useful, and are a good way to
progress towards higher availability. I think a better way to fix the
site_stats contention would have been to insert an unconditional COMMIT in
SiteStatsUpdate::doUpdate().
If the connection count on the ES master really is a problem (not just a
symptom of a much larger problem), then that can be mitigated by closing
the connections early. But I think the only reason we're seeing this come
out on the ES servers is because they have the lowest number of maximum
connections, so they fail first.
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Page saving slowness and some loading breakage today