We had a serious MySQL crash on suda with associated data corruption today (6/6). There's a summary of the events leading up to the crash at http://openfacts.berlios.de/index-en.phtml?title=Wikipedia_plans
Whether it was the kill -9 that led to the corruption or whether the database was already corrupted and did therefore not respond we do not know; in any case, there seem to have been no alternatives to killing it (people on #mysql knew nothing either).
Shaihulud made a copy of the CUR tables from all wikis earlier today and imported it on Ariel. We've switched the live wikis to readonly from Ariel; readonly because Ariel doesn't have OLD and lots of other stuff, because it's not sufficiently tested, and because we'd like to prevent any data loss if possible.
Tim created a special "readonly" user on ariel for this purpose.
The following have been changed as long as we are in readonly mode: 1) Counters disabled on all wikis 2) Linkscc disabled 3) readonly file set to /home/wikipedia/common/readonly 4) user_newtalk disabled 5) $wgDatabaseServer and $wgDBuser changed to ariel
There will still be lots of error messages and because the OLD tables are not on Ariel revision histories are missing etc. This is *only* to make sure that people can read our articles.
The next step is to fix the data corruption on suda.
Regards,
Erik
Current status:
We've managed to recover a dump of enwiki.cur minus a couple of damaged pages (one on [[Surrey, Virginia]], and one or two next to it). Currently running a backup of everything else so we can dump-and-restore to a clean database.
We should be back online sometime tonight where "tonight" is defined as "before I get any sleep".
Also Ariel seems to have developed some sort of difficulty.
-- brion vibber (brion @ pobox.com)
On Jun 6, 2004, at 7:00 PM, Brion Vibber wrote:
We should be back online sometime tonight where "tonight" is defined as "before I get any sleep".
I just want to take a moment to say thanks. that kind of dedication is much appreciated. wikipedia would have never become what it is today without your help or this kind of dedication, its easy for people to say "X is broken *somebody* fix it" and it seems you are more often than not that *somebody*
Lightning
I second that. Am in awe of you guyses ability and dedication. Much thanks.
Lightning said:
On Jun 6, 2004, at 7:00 PM, Brion Vibber wrote:
We should be back online sometime tonight where "tonight" is defined as "before I get any sleep".
I just want to take a moment to say thanks. that kind of dedication is much appreciated. wikipedia would have never become what it is today without your help or this kind of dedication, its easy for people to say "X is broken *somebody* fix it" and it seems you are more often than not that *somebody*
Lightning
Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Lightning wrote:
On Jun 6, 2004, at 7:00 PM, Brion Vibber wrote:
We should be back online sometime tonight where "tonight" is defined as "before I get any sleep".
I just want to take a moment to say thanks. that kind of dedication is much appreciated. wikipedia would have never become what it is today without your help or this kind of dedication, its easy for people to say "X is broken *somebody* fix it" and it seems you are more often than not that *somebody*
Thanks for the compliment, but this long downtime and heroics are totally unnecessary. If we'd managed to keep the database slave server running and in sync we should have been back online within an hour.
Spare-time off-site volunteer labor is fine as far as it goes, but managing a server farm on a high-traffic dynamic web site is hard to do remotely.
-- brion vibber (brion @ pobox.com)
That took a lot longer than I'd hoped, but we're back online.
Damaged pages are five Rambot entries: http://en.wikipedia.org/wiki/Tazewell_County,_Virginia http://en.wikipedia.org/wiki/Sussex_County,_Virginia http://en.wikipedia.org/wiki/Surry_County,_Virginia http://en.wikipedia.org/wiki/Suffolk,_Virginia http://en.wikipedia.org/wiki/Staunton,_Virginia
They can be restored from backup if anyone wants to.
imagelinks tables have been cleared as we forgot to add them to the backup job. There were problems in the to begin with, so we should regenerate them anyway when the opportunity arises.
archive tables were not backed up. If there is great desire for them it may be possible to retrieve them from the old database files.
Backup dump is available, though enwiki needs some tweaking so don't download it quite yet.
We should be able to set up a slave server and *keep the fricking' thing in sync* this time. I hope. sigh.
Ariel has crashed a couple times in the last day so it's not reliable at present. Might be hardware, might be software. Needs testing. Will probably use another machine as a slave.
-- brion vibber (brion @ pobox.com)
Why don't cash in on this occasion to make a clean switch to Ariel ? Yesterday it was so terrible to edit article (before the crach) that I prefere 24h down time rather than N days of exterme slowness.
Aoineko
----- Original Message ----- From: "Erik Moeller" erik_moeller@gmx.de To: wikitech-l@wikimedia.org Sent: Monday, June 07, 2004 5:17 AM Subject: [Wikitech-l] Big crash 6/6
We had a serious MySQL crash on suda with associated data corruption today (6/6). There's a summary of the events leading up to the crash at http://openfacts.berlios.de/index-en.phtml?title=Wikipedia_plans
Whether it was the kill -9 that led to the corruption or whether the database was already corrupted and did therefore not respond we do not know; in any case, there seem to have been no alternatives to killing it (people on #mysql knew nothing either).
Shaihulud made a copy of the CUR tables from all wikis earlier today and imported it on Ariel. We've switched the live wikis to readonly from Ariel; readonly because Ariel doesn't have OLD and lots of other stuff, because it's not sufficiently tested, and because we'd like to prevent any data loss if possible.
Tim created a special "readonly" user on ariel for this purpose.
The following have been changed as long as we are in readonly mode:
- Counters disabled on all wikis
- Linkscc disabled
- readonly file set to /home/wikipedia/common/readonly
- user_newtalk disabled
- $wgDatabaseServer and $wgDBuser changed to ariel
There will still be lots of error messages and because the OLD tables are not on Ariel revision histories are missing etc. This is *only* to make sure that people can read our articles.
The next step is to fix the data corruption on suda.
Regards,
Erik _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Guillaume Blanchard wrote:
Why don't cash in on this occasion to make a clean switch to Ariel ? Yesterday it was so terrible to edit article (before the crach) that I prefere 24h down time rather than N days of exterme slowness.
Right now no one can log in on ariel.
-- brion vibber (brion @ pobox.com)
On 6 Jun 2004, Erik Moeller wrote:
Whether it was the kill -9 that led to the corruption or whether the database was already corrupted and did therefore not respond we do not know; in any case, there seem to have been no alternatives to killing it (people on #mysql knew nothing either).
...
Note: I am not pointing fingers.
Looking at my irc log (which contains 4000 lines from 6/6 alone), it appears Tim started a second mysqld on the running database machine. In the future, do not attempt this. It takes a lot of setup to prevent them from stepping on each other. I don't know how much isolation there was in this case. (I know, because I've ran v3, v4.0, and v4.1 mysql's on the same machine. Along with half a dozen other databases :-) It's easier if you've compiled everything yourself.)
--Ricky
wikitech-l@lists.wikimedia.org