Hi!
Would you mind giving us a few details about the whys, hows, and solutions? I've had a couple out of memory problems recently (especially while trying to run DumpHTML) and our first try for a fix was to check efficiency of our extensions. That is certainly taking a while.
Apparently we hit a combination of bad things. This is how I imagine what could have happened.
There was huge ongoing batch job (recompression of blobs). There was some huge transaction (I didn't identify it yet, didn't look at it yet), that locked up lots of stuff, maybe it was recompression- related, maybe it was not. Crash recovery identified half a million of uncommitted row changes. MySQL internal undo segments limit has been reached (there were 1024 active transactions that modified data, the other option would be running out of 1G of transaction log space, which was unlikely) More and more hanging transactions led to more and more clients connecting Slaves reported lag (as there were no new transactions incoming) LB sent all the read queries to master Master had much more workload to do Probably it started allocating more memory (I don't see trace on ganglia's daily graph anymore though..) Kernel OOM killer jumped in LB decided that all slaves are lagged and wanted still to use master (this is new code, we should've failed gracefully and switched site to R/O instead, as master was down).
So, for few minutes we showed down notice, then I switched the site readonly, and within few minutes, instead of waiting for crash recovery, I promoted another slave as master.
Strange though, pity I didn't see what was happening at the beginning.
Our enwiki master was up and running and kicking for nearly two years (we even had a stable master-slave relationship between db2 and db3 that took over a year ;-)
Cheers,