Ryan Lane wrote a script to purge some of the Flaged Rev memcached entries; that ran last night as well.
The DOM-related errors all seem to have come from srv227; apache on that host was restarted about half an hour ago and the results look good.
Ariel
Στις 26-12-2010, ημέρα Κυρ, και ώρα 01:49 +0100, ο/η Platonides έγραψε:
Earlier today, /a filled with binlogs in db27, which was s3 & s7 master. nagios had warned too early / nobody noticed. Slaves lagged, lots of locks, the wikis got to a halt. Revisions between 6:50 and 8:20 pm UTC were lost (although they can be manually reimported from db27). The new s3 and s7 master is db17, with only one slave: db25. After the master switch, we started having problems due to cached revision text in memcached, due to the duplication of old_id values, so we made them read-only until UTC midnight.
We decided not to disable $wgRevisionCacheExpiry but to remove the faulty entries, thus I quickly prepared the script maintenance/purgeStaleMemcachedText.php to clean them.
There were problems in hewiki, since data there didn't clean. On one instance doing $wgMemc->get persisted even after a $wgMemc->delete on that same key (???). Other than the hewiki issues, it seemed to run fine. There will be lots of wrong entries in diff and parser cache needing a manual action=purge but a purge will clean them. Flagged revs caches were not touched. Wikis using it may show the wrong content (with the additional fun of some users viewing the right one).
There are also PPFrame_DOM->expand errors that started around the same time, even on wikis on a different cluster. They usually only happen once, and it succeeds just reloading. https://bugzilla.wikimedia.org/show_bug.cgi?id=26429
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l