Christmas server failure report - Wikitech-l

25 Dec 2010

Earlier today, /a filled with binlogs in db27, which was s3 & s7 master.
nagios had warned too early / nobody noticed. Slaves lagged, lots of
locks, the wikis got to a halt.
Revisions between 6:50 and 8:20 pm UTC were lost (although they can be
manually reimported from db27).
The new s3 and s7 master is db17, with only one slave: db25.
After the master switch, we started having problems due to cached
revision text in memcached, due to the duplication of old_id values,
so we made them read-only until UTC midnight.

We decided not to disable $wgRevisionCacheExpiry but to remove the
faulty entries, thus I quickly prepared the script
maintenance/purgeStaleMemcachedText.php to clean them.

There were problems in hewiki, since data there didn't clean. On one
instance doing $wgMemc->get persisted even after a $wgMemc->delete on
that same key (???).
Other than the hewiki issues, it seemed to run fine. There will be lots
of wrong entries in diff and parser cache needing a manual action=purge
but a purge will clean them.
Flagged revs caches were not touched. Wikis using it may show the wrong
content (with the additional fun of some users viewing the right one).

There are also PPFrame_DOM->expand errors that started around the same
time, even on wikis on a different cluster. They usually only happen
once, and it succeeds just reloading.
https://bugzilla.wikimedia.org/show_bug.cgi?id=26429