Re: [Wikipedia-l] Wiki Problems? - Wikitech-l

23 Feb 2005


      Tony Sidaway wrote:
...
Brion Vibber said:
...
Brion Vibber wrote:
...
Update logs are still replaying, but we're up to 42 minutes prior to
the crash on one machine and still going. I don't expect problems.
With two servers fully recovered we've got the wikis up for read-write
access; editing is open. Total time from crash to restoring edit
service was about 24 hours, 10 minutes. Sigh.
Some special pages (including contribs and watchlist) are off for the
moment to reduce server load until we have more machines up. Some
things remain a little wonky.
Interesting discussion on Slashdot about the relative recoverability of
Postgresql.  If we stay with open source DBMS, perhaps at least some of
the database servers should be running alternative software.
Kudos to the developers for their heroic efforts in bringing everything
back from what threatened to be a serious data loss.
Regarding using different databases: I agree, diversity is good.
However, I should point out that I have destroyed a PostgreSQL database
on one occasion by power-cycling a machine (it was running VACUUM at the
time). This makes me sceptical about relying on software diversity
alone, particularly in the face of crude threats such as power loss,
fire, tornadoes and flood.
The value of the Wikipedia data is now big enough that it worth putting
a formal disaster recovery plan in place.
A good idea in the short term might be to keep a slave database or two
offsite, so that it they are unlikely to crash at the same time as the
central site. Note that online slaves are not a 100% solution to data
corruption, as they will faithfully mirror any corruption which
accumulates from causes other than database failure.
This emphasizes the importance of taking and saving snapshot dumps. At
the moment, keeping off-site dumps is done on an ad-hoc basis by
volunteers.  This should certainly be formalized to include the
automatic creation and archiving of dumps off-site, in addition to
running offsite slave databases.  At a data rate of only 10 Mbits/s, a
170GB offsite backup would take only 38 hours to move offsite. At these
sorts of rates, monthly backups could be lodged with any of a number of
mirror services, perhaps organizations such as universities, the UK
mirror service and Internet artchive might be interested in doing this?
The current worst case would be physical destruction of the servers at
the Florida colo; the data is both priceless and uninsurable, but is the
server farm insured against this sort of event?
-- Neil