Wiki Problems?

List overview All Threads
Download

newer

older

National Public Radio (US):...

Re: [Wikipedia-l] Re: Quenya...

James R. Johnson

21 Feb 2005 21 Feb '05

11:57 p.m.

Hey,

Is there something wrong with the wikis? I was trying to do some writing on ang.wikibooks.org, and ang.wiktionary.org and they don't work. Are they down right now, or did something else happen?

Thanks,

James

Show replies by date

Brion Vibber

22 Feb 22 Feb

12:07 a.m.

James R. Johnson wrote:

...

        Is there something wrong with the wikis?  I was trying to do
some writing on ang.wikibooks.org, and ang.wiktionary.org and they don't work. Are they down right now, or did something else happen?

There was some sort of power failure at the colocation facility. We're in the process of rebooting and recovering machines.

-- brion vibber (brion @ pobox.com)

Brion Vibber

12:47 p.m.

Brion Vibber wrote:

...

James R. Johnson wrote:

...
        Is there something wrong with the wikis?  I was trying to do
some writing on ang.wikibooks.org, and ang.wiktionary.org and they don't work. Are they down right now, or did something else happen?
There was some sort of power failure at the colocation facility. We're in the process of rebooting and recovering machines.

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)

The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

-- brion vibber (brion @ pobox.com)

Gerrit Holl

5:49 p.m.

Brion Vibber wrote:

...

Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Will any changes be lost?

regards, Gerrit Holl.

-- Weather in Twenthe, Netherlands 22/02 18:25: -1.0°C light snow showers; Towering cumulus clouds observed partly cloudy wind 3.1 m/s NE (57 m above NAP) -- In the councils of government, we must guard against the acquisition of unwarranted influence, whether sought or unsought, by the military-industrial complex. The potential for the disastrous rise of misplaced power exists and will persist. -Dwight David Eisenhower, January 17, 1961

Brion Vibber

6:19 p.m.

New subject: [Wikitech-l] Re: Wiki Problems?

Gerrit Holl wrote:

...

Brion Vibber wrote:

...
Date: Tue, 22 Feb 2005 13:48:17 +0100 (CET)

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Will any changes be lost?

As far as we know no, no changes should be lost (except potentially a handful at the very end).

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

-- brion vibber (brion @ pobox.com)

Brion Vibber

11:20 p.m.

Brion Vibber wrote:

...

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

-- brion vibber (brion @ pobox.com)

Tony Sidaway

23 Feb 23 Feb

10:06 a.m.

Brion Vibber said:

...

Brion Vibber wrote:

...
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.

Gerard Meijssen

10:18 a.m.

Tony Sidaway wrote:

...

Brion Vibber said:

...
Brion Vibber wrote:

...
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.

At this moment the current colo is a single point of failure. The French squids will not work if the database is not available. If you want a 100% uptime, you need the complete stack of software elsewhere to allow for a fail over. This is something we do not do yet. When we have full redundancy, we will still have problems. We will have different problems.

So maybe PostgreSQL is better at recovery. It is not the whole solution, it would at best solve one problem. It would create others.

Thanks, GerardM

Tony Sidaway

10:22 a.m.

Gerard Meijssen said:

...

So maybe PostgreSQL is better at recovery. It is not the whole solution, it would at best solve one problem. It would create others.

Yes, it may do. But clearly recovery could be much better. Modern database systems, even very large ones, should not become completely corrupted simply because the power fails. It seems to have been a matter of luck that one uncorrupted system existed.

Kate Turner

10:55 a.m.

Tony Sidaway wrote in gmane.science.linguistics.wikipedia.misc:

...

Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.

it's probably a little early to blame mysql. most of livejournal's problems, when the same thing happened to them, were caused by problems with their RAID controllers rather than mysql itself.

kate.

Tony Sidaway

11:11 a.m.

Kate Turner said:

...

Tony Sidaway wrote in gmane.science.linguistics.wikipedia.misc:

...
Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.

it's probably a little early to blame mysql. most of livejournal's problems, when the same thing happened to them, were caused by problems with their RAID controllers rather than mysql itself.

Thanks for the background. And thanks for your hard work in getting our Wiki back up and running.

Peter Jacobi

1 p.m.

Dear All,

kate wrote:

...

it's probably a little early to blame mysql. most of livejournal's problems, when the same thing happened to them, were caused by problems with their RAID controllers rather than mysql itself.

I know that "blame MySQL" is a stereotype and somewhat unhelpful advice.

I also know, that a complete failover second location would make the point somewhat irrelevant, and may be the "sort of final" solution.

But I'm wondering whether Firebird SQL would be a replacement candidate. Firebird is supposed to be free of the problem, they are discussing the Wikipedia blackout, and they are also discussing the low visibility of Firebird and things to improve it. Perhaps to the point, where Firebird developers would be willing to contribute to integration efforts and add features and fine-tuning to Firebird SQL?

Regards, Peter Jacobi

-- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

Neil Harris

11:17 a.m.

Tony Sidaway wrote:

...

Brion Vibber said:

...
Brion Vibber wrote:

...
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.

Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.

Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.

Kudos to the developers for their heroic efforts in bringing everything back from what threatened to be a serious data loss.

Regarding using different databases: I agree, diversity is good. However, I should point out that I have destroyed a PostgreSQL database on one occasion by power-cycling a machine (it was running VACUUM at the time). This makes me sceptical about relying on software diversity alone, particularly in the face of crude threats such as power loss, fire, tornadoes and flood.

The value of the Wikipedia data is now big enough that it worth putting a formal disaster recovery plan in place.

A good idea in the short term might be to keep a slave database or two offsite, so that it they are unlikely to crash at the same time as the central site. Note that online slaves are not a 100% solution to data corruption, as they will faithfully mirror any corruption which accumulates from causes other than database failure.

This emphasizes the importance of taking and saving snapshot dumps. At the moment, keeping off-site dumps is done on an ad-hoc basis by volunteers. This should certainly be formalized to include the automatic creation and archiving of dumps off-site, in addition to running offsite slave databases. At a data rate of only 10 Mbits/s, a 170GB offsite backup would take only 38 hours to move offsite. At these sorts of rates, monthly backups could be lodged with any of a number of mirror services, perhaps organizations such as universities, the UK mirror service and Internet artchive might be interested in doing this?

The current worst case would be physical destruction of the servers at the Florida colo; the data is both priceless and uninsurable, but is the server farm insured against this sort of event?

-- Neil

Robert Brockway

5:12 a.m.

New subject: [Wikitech-l] Re: Wiki Problems?

On Tue, 22 Feb 2005, Brion Vibber wrote:

...

As far as we know no, no changes should be lost (except potentially a handful at the very end).

Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.

Well done guys and girls. I think some people got little sleep last night.

Rob

-- Robert Brockway B.Sc. Senior Technical Consultant, OpenTrend Solutions Ltd. Phone: 416-669-3073 Email: rbrockway@opentrend.net http://www.opentrend.net OpenTrend Solutions: Reliable, secure solutions to real world problems. Contributing Member of Software in the Public Interest (http://www.spi-inc.org)

Matt Brown

22 Feb 22 Feb

10:29 p.m.

New subject: [Wikitech-l] Re: Wiki Problems?

On Tue, 22 Feb 2005 04:47:56 -0800, Brion Vibber brion@pobox.com wrote:

...

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

That's pretty much nightmare #1 for anyone operating in a colocation facility. I know if this kind of thing happened to me, I'd have management and customers on my back immediately, asking when we were moving datacenters to someone else ...

Nothing like a real failure to show you how truly redundant (or not) the systems actually are. Was it equipment failure or human failure? Either way, it sounds like your redundant power circuits were routed through the same circuit breaker cabinet or both got shorted out by the same issue. Not good.

Of all MySQL's faults, corruption on ungraceful shutdown is one of the worst. I've had similar incidents on Oracle dozens of times and never had to restore from backups.

-Matt

7059

Age (days ago)

7061

Last active (days ago)

wikipedia-l@lists.wikimedia.org

14 comments

10 participants

tags (0)

participants (10)

Brion Vibber
Gerard Meijssen
Gerrit Holl
James R. Johnson
Kate Turner
Matt Brown
Neil Harris
Peter Jacobi
Robert Brockway
Tony Sidaway