Re: [Wikitech-l] English Wikipedia brief outage today

31 Mar 2009

      Hi!
...
Would you mind giving us a few details about the whys, hows, and  
solutions?  I've had a couple out of memory problems recently  
(especially while trying to run DumpHTML) and our first try for a  
fix was to check efficiency of our extensions.  That is certainly  
taking a while.
Apparently we hit a combination of bad things. This is how I imagine  
what could have happened.
There was huge ongoing batch job (recompression of blobs).
There was some huge transaction (I didn't identify it yet, didn't look  
at it yet), that locked up lots of stuff, maybe it was recompression- 
related, maybe it was not. Crash recovery identified half a million of  
uncommitted row changes.
MySQL internal undo segments limit has been reached (there were 1024  
active transactions that modified data, the other option would be  
running out of 1G of transaction log space, which was unlikely)
More and more hanging transactions led to more and more clients  
connecting
Slaves reported lag (as there were no new transactions incoming)
LB sent all the read queries to master
Master had much more workload to do
Probably it started allocating more memory (I don't see trace on  
ganglia's daily graph anymore though..)
Kernel OOM killer jumped in
LB decided that all slaves are lagged and wanted still to use master  
(this is new code, we should've failed gracefully and switched site to  
R/O instead, as master was down).
So, for few minutes we showed down notice, then I switched the site  
readonly, and within few minutes, instead of waiting for crash  
recovery, I promoted another slave as master.
Strange though, pity I didn't see what was happening at the beginning.
Our enwiki master was up and running and kicking for nearly two years  
(we even had a stable master-slave relationship between db2 and db3  
that took over a year ;-)
Cheers,
-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

P.S. I was going from shower to bed, it was past 2am, and strange  
intuition asked me to check "whats up". I rarely turn on screens at  
this state of evening :) 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] English Wikipedia brief outage today