Re: [Wikipedia-l] Wiki Problems?

List overview All Threads
Download

newer

older

Bug in hooks code for 1.4

Brion Vibber

22 Feb 2005 22 Feb '05

1:47 p.m.

Brion Vibber wrote:

...

James R. Johnson wrote:

...
        Is there something wrong with the wikis?  I was trying to do
some writing on ang.wikibooks.org, and ang.wiktionary.org and they don't work. Are they down right now, or did something else happen?
There was some sort of power failure at the colocation facility. We're in the process of rebooting and recovering machines.

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)

The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

-- brion vibber (brion @ pobox.com)

Attachments:

signature.asc (application/pgp-signature — 253 bytes)

Show replies by date

Gerrit Holl

22 Feb 22 Feb

6:49 p.m.

New subject: [Wikipedia-l] Wiki Problems?

Brion Vibber wrote:

...

Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Will any changes be lost?

regards, Gerrit Holl.

-- Weather in Twenthe, Netherlands 22/02 18:25: -1.0°C light snow showers; Towering cumulus clouds observed partly cloudy wind 3.1 m/s NE (57 m above NAP) -- In the councils of government, we must guard against the acquisition of unwarranted influence, whether sought or unsought, by the military-industrial complex. The potential for the disastrous rise of misplaced power exists and will persist. -Dwight David Eisenhower, January 17, 1961

Brion Vibber

7:19 p.m.

New subject: [Wikipedia-l] Wiki Problems?

Gerrit Holl wrote:

...

Brion Vibber wrote:

...
Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Will any changes be lost?

As far as we know no, no changes should be lost (except potentially a handful at the very end).

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

-- brion vibber (brion @ pobox.com)

Alfio Puglisi

10:30 p.m.

New subject: [Wikipedia-l] Wiki Problems?

On Tue, 22 Feb 2005, Brion Vibber wrote:

...

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

-- brion vibber (brion @ pobox.com)

One can see the logs replaying checking the "Recent changes" page, like the more ordinary activity :-) On it: we are at yesterday's 23:13 UTC, about 23 hours ago. I suppose the last change displayed depends on which server one hits.

Alfio

Brion Vibber

23 Feb 23 Feb

12:20 a.m.

New subject: [Wikipedia-l] Wiki Problems?

Brion Vibber wrote:

...

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

-- brion vibber (brion @ pobox.com)

David Benbennick

1:04 a.m.

New subject: [Wikipedia-l] Wiki Problems?

Surely you're aware of this error:

<error message> Warning: file(/home/wikipedia/common/all.dblist): failed to open stream: Stale NFS file handle in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9

Warning: array_map(): Argument #2 should be an array in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9

Warning: Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.4/includes/SiteConfiguration.php on line 54

Wiki does not exist

...

From Meta, a wiki about Wikimedia

This domain (en.wikipedia.org) has been reserved for the Wikipedia in the English language. Would you like this wiki to be created? </error message>

It even gives a nice "create wiki" button!

I get this error for a reasonably high percentage of page views.

On Tue, 22 Feb 2005 15:20:54 -0800, Brion Vibber brion@pobox.com wrote:

...

Brion Vibber wrote: Some things remain a little wonky.

Brion Vibber

1:24 a.m.

New subject: [Wikipedia-l] Wiki Problems?

David Benbennick wrote:

...

Surely you're aware of this error:

<error message> Warning: file(/home/wikipedia/common/all.dblist): failed to open stream: Stale NFS file handle in /usr/local/apache/common-local/php-1.4/InitialiseSettings.php on line 9

I'm not seeing this, and all apaches that are up seem able to stat that file. Can you confirm, and include the exact URLs you're trying?

-- brion vibber (brion @ pobox.com)

Yann Forget

10:30 a.m.

New subject: [Wikipedia-l] Wiki Problems?

Le Wednesday 23 February 2005 00:20, Brion Vibber a écrit :

...

Brion Vibber wrote:

...
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

-- brion vibber (brion @ pobox.com)

Thanks a lot for your work !

Yann

-- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikipedia.org/ | Encyclopédie libre http://www.forget-me.net/pro/ | Formations et services Linux

Robert Brockway

6:12 a.m.

New subject: [Wikipedia-l] Wiki Problems?

On Tue, 22 Feb 2005, Brion Vibber wrote:

...

As far as we know no, no changes should be lost (except potentially a handful at the very end).

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

Well done guys and girls. I think some people got little sleep last night.

Rob

-- Robert Brockway B.Sc. Senior Technical Consultant, OpenTrend Solutions Ltd. Phone: 416-669-3073 Email: rbrockway@opentrend.net http://www.opentrend.net OpenTrend Solutions: Reliable, secure solutions to real world problems. Contributing Member of Software in the Public Interest (http://www.spi-inc.org)

Matt Brown

22 Feb 22 Feb

11:29 p.m.

New subject: [Wikipedia-l] Wiki Problems?

On Tue, 22 Feb 2005 04:47:56 -0800, Brion Vibber brion@pobox.com wrote:

...

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

That's pretty much nightmare #1 for anyone operating in a colocation facility. I know if this kind of thing happened to me, I'd have management and customers on my back immediately, asking when we were moving datacenters to someone else ...

Nothing like a real failure to show you how truly redundant (or not) the systems actually are. Was it equipment failure or human failure? Either way, it sounds like your redundant power circuits were routed through the same circuit breaker cabinet or both got shorted out by the same issue. Not good.

Of all MySQL's faults, corruption on ungraceful shutdown is one of the worst. I've had similar incidents on Oracle dozens of times and never had to restore from backups.

-Matt

7238

Age (days ago)

7239

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

7 participants

tags (0)

participants (7)

Alfio Puglisi
Brion Vibber
David Benbennick
Gerrit Holl
Matt Brown
Robert Brockway
Yann Forget