Sprawa NIE DOTYCZY bezpośrednio polskiej Wikipedii, ale mniejszych projektów, a w związku z tym, że zaglądają tu także użytkownicy siostrzanych projektów, to podsyłam informację o awarii jaka miała miejsce i jej konsekwencjach.
-- Leinad
---------- Forwarded message ---------- From: Platonides Platonides@gmail.com Date: 2010/12/26 Subject: [Wikitech-l] Christmas server failure report To: wikitech-l@lists.wikimedia.org
Earlier today, /a filled with binlogs in db27, which was s3 & s7 master. nagios had warned too early / nobody noticed. Slaves lagged, lots of locks, the wikis got to a halt. Revisions between 6:50 and 8:20 pm UTC were lost (although they can be manually reimported from db27). The new s3 and s7 master is db17, with only one slave: db25. After the master switch, we started having problems due to cached revision text in memcached, due to the duplication of old_id values, so we made them read-only until UTC midnight.
We decided not to disable $wgRevisionCacheExpiry but to remove the faulty entries, thus I quickly prepared the script maintenance/purgeStaleMemcachedText.php to clean them.
There were problems in hewiki, since data there didn't clean. On one instance doing $wgMemc->get persisted even after a $wgMemc->delete on that same key (???). Other than the hewiki issues, it seemed to run fine. There will be lots of wrong entries in diff and parser cache needing a manual action=purge but a purge will clean them. Flagged revs caches were not touched. Wikis using it may show the wrong content (with the additional fun of some users viewing the right one).
There are also PPFrame_DOM->expand errors that started around the same time, even on wikis on a different cluster. They usually only happen once, and it succeeds just reloading. https://bugzilla.wikimedia.org/show_bug.cgi?id=26429
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l