Hi!
Would you mind giving us a few details about the whys,
hows, and
solutions? I've had a couple out of memory problems recently
(especially while trying to run DumpHTML) and our first try for a
fix was to check efficiency of our extensions. That is certainly
taking a while.
Apparently we hit a combination of bad things. This is how I imagine
what could have happened.
There was huge ongoing batch job (recompression of blobs).
There was some huge transaction (I didn't identify it yet, didn't look
at it yet), that locked up lots of stuff, maybe it was recompression-
related, maybe it was not. Crash recovery identified half a million of
uncommitted row changes.
MySQL internal undo segments limit has been reached (there were 1024
active transactions that modified data, the other option would be
running out of 1G of transaction log space, which was unlikely)
More and more hanging transactions led to more and more clients
connecting
Slaves reported lag (as there were no new transactions incoming)
LB sent all the read queries to master
Master had much more workload to do
Probably it started allocating more memory (I don't see trace on
ganglia's daily graph anymore though..)
Kernel OOM killer jumped in
LB decided that all slaves are lagged and wanted still to use master
(this is new code, we should've failed gracefully and switched site to
R/O instead, as master was down).
So, for few minutes we showed down notice, then I switched the site
readonly, and within few minutes, instead of waiting for crash
recovery, I promoted another slave as master.
Strange though, pity I didn't see what was happening at the beginning.
Our enwiki master was up and running and kicking for nearly two years
(we even had a stable master-slave relationship between db2 and db3
that took over a year ;-)
Cheers,
--
Domas Mituzas --
http://dammit.lt/ -- [[user:midom]]
P.S. I was going from shower to bed, it was past 2am, and strange
intuition asked me to check "whats up". I rarely turn on screens at
this state of evening :)