Tony Sidaway wrote:
Brion Vibber said:
Brion Vibber wrote:
Update logs are still replaying, but we're up to 42 minutes prior to the crash on one machine and still going. I don't expect problems.
With two servers fully recovered we've got the wikis up for read-write access; editing is open. Total time from crash to restoring edit service was about 24 hours, 10 minutes. Sigh.
Some special pages (including contribs and watchlist) are off for the moment to reduce server load until we have more machines up. Some things remain a little wonky.
Interesting discussion on Slashdot about the relative recoverability of Postgresql. If we stay with open source DBMS, perhaps at least some of the database servers should be running alternative software.
Kudos to the developers for their heroic efforts in bringing everything back from what threatened to be a serious data loss.
Regarding using different databases: I agree, diversity is good. However, I should point out that I have destroyed a PostgreSQL database on one occasion by power-cycling a machine (it was running VACUUM at the time). This makes me sceptical about relying on software diversity alone, particularly in the face of crude threats such as power loss, fire, tornadoes and flood.
The value of the Wikipedia data is now big enough that it worth putting a formal disaster recovery plan in place.
A good idea in the short term might be to keep a slave database or two offsite, so that it they are unlikely to crash at the same time as the central site. Note that online slaves are not a 100% solution to data corruption, as they will faithfully mirror any corruption which accumulates from causes other than database failure.
This emphasizes the importance of taking and saving snapshot dumps. At the moment, keeping off-site dumps is done on an ad-hoc basis by volunteers. This should certainly be formalized to include the automatic creation and archiving of dumps off-site, in addition to running offsite slave databases. At a data rate of only 10 Mbits/s, a 170GB offsite backup would take only 38 hours to move offsite. At these sorts of rates, monthly backups could be lodged with any of a number of mirror services, perhaps organizations such as universities, the UK mirror service and Internet artchive might be interested in doing this?
The current worst case would be physical destruction of the servers at the Florida colo; the data is both priceless and uninsurable, but is the server farm insured against this sort of event?
-- Neil
wikitech-l@lists.wikimedia.org