Brion Vibber wrote:
James R. Johnson wrote:
Is there something wrong with the wikis? I was trying to do
some writing on ang.wikibooks.org, and ang.wiktionary.org and they don't work. Are they down right now, or did something else happen?
There was some sort of power failure at the colocation facility. We're in the process of rebooting and recovering machines.
The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.
Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)
The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
Will any changes be lost?
regards, Gerrit Holl.
Gerrit Holl wrote:
Brion Vibber wrote:
Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
Will any changes be lost?
As far as we know no, no changes should be lost (except potentially a handful at the very end).
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
-- brion vibber (brion @ pobox.com)
On Tue, 22 Feb 2005, Brion Vibber wrote:
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
-- brion vibber (brion @ pobox.com)
One can see the logs replaying checking the "Recent changes" page, like the more ordinary activity :-) On it: we are at yesterday's 23:13 UTC, about 23 hours ago. I suppose the last change displayed depends on which server one hits.
Alfio
Brion Vibber wrote:
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.
Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.
-- brion vibber (brion @ pobox.com)
Surely you're aware of this error:
<error message> Warning: file(/home/wikipedia/common/all.dblist): failed to open stream: Stale NFS file handle in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9
Warning: array_map(): Argument #2 should be an array in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9
Warning: Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.4/includes/SiteConfiguration.php on line 54
Wiki does not exist
From Meta, a wiki about Wikimedia
This domain (en.wikipedia.org) has been reserved for the Wikipedia in the English language. Would you like this wiki to be created? </error message>
It even gives a nice "create wiki" button!
I get this error for a reasonably high percentage of page views.
On Tue, 22 Feb 2005 15:20:54 -0800, Brion Vibber brion@pobox.com wrote:
Brion Vibber wrote: Some things remain a little wonky.
David Benbennick wrote:
Surely you're aware of this error:
<error message> Warning: file(/home/wikipedia/common/all.dblist): failed to open stream: Stale NFS file handle in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9
I'm not seeing this, and all apaches that are up seem able to stat that file. Can you confirm, and include the exact URLs you're trying?
-- brion vibber (brion @ pobox.com)
Le Wednesday 23 February 2005 00:20, Brion Vibber a écrit :
Brion Vibber wrote:
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.
Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.
-- brion vibber (brion @ pobox.com)
Thanks a lot for your work !
Yann
On Tue, 22 Feb 2005, Brion Vibber wrote:
As far as we know no, no changes should be lost (except potentially a handful at the very end).
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
Well done guys and girls. I think some people got little sleep last night.
Rob
On Tue, 22 Feb 2005 04:47:56 -0800, Brion Vibber brion@pobox.com wrote:
The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.
That's pretty much nightmare #1 for anyone operating in a colocation facility. I know if this kind of thing happened to me, I'd have management and customers on my back immediately, asking when we were moving datacenters to someone else ...
Nothing like a real failure to show you how truly redundant (or not) the systems actually are. Was it equipment failure or human failure? Either way, it sounds like your redundant power circuits were routed through the same circuit breaker cabinet or both got shorted out by the same issue. Not good.
Of all MySQL's faults, corruption on ungraceful shutdown is one of the worst. I've had similar incidents on Oracle dozens of times and never had to restore from backups.
-Matt
wikitech-l@lists.wikimedia.org