[Foundation-l] Request: WMF commitment as a long term cultural archive?
Neil Harris
neil at tonal.clara.co.uk
Thu Jun 2 23:11:35 UTC 2011
On 02/06/11 19:52, George Herbert wrote:
> On Thu, Jun 2, 2011 at 10:55 AM, David Gerard<dgerard at gmail.com> wrote:
>> On 2 June 2011 18:48, Fae<faenwp at gmail.com> wrote:
>>
>>> In 2016 San Francisco has a major earthquake and the servers and
>>> operational facilities for the WMF are damaged beyond repair. The
>>> emergency hot switchover to Hong Kong is delayed due to an ongoing DoS
>>> attack from Eastern European countries. The switchover eventually
>>> appears successful and data is synchronized with Hong Kong for the
>>> next 3 weeks. At the end of 3 weeks, with a massive raft of escalating
>>> complaints about images disappearing, it is realized that this is a
>>> result of local data caches expiring. The DoS attack covered the
>>> tracks of a passive data worm that only activates during back-up
>>> cycles and the loss is irrecoverable due backups aged over 2 weeks
>>> being automatically deleted. Due to no archive strategy it is
>>> estimated that the majority of digital assets have been permanently
>>> lost and estimates for 60% partial reconstruction from remaining cache
>>> snapshots and independent global archive sites run to over 2 years of
>>> work.
>>
>> This sort of scenario is why some of us have a thing about the backups :-)
>>
>> (Is there a good image backup of Commons and of the larger wikis, and
>> - and this one may be trickier - has anyone ever downloaded said
>> backups?)
>>
>>
>> - d.
> I've floated this to Erik a couple of times, but if the Foundation
> would like an IT disaster response / business continuity audit, I can
> do those.
>
>
Tape is -- still -- your friend here. Flip the write-protect after
writing, have two sets of off-site tapes, one copy of each in each of
two secure and widely separated off-site locations run by two different
organizations, and you're sorted.
Tape is the dumb backstop that will keep the data even when your
supposedly infallible replicated and redundant systems fail. For
example, it got Google out of a hole quite recently when they had to
restore a significant number of Gmail accounts from tape. (see
http://www.talkincloud.com/the-solution-to-the-gmail-glitch-tape-backup/ )
And, unlike other long-term storage media, there is a long history of
tape storage, an understanding of its practical lifespan and risks, and
well-understood procedures for making and verifying duplicate sub-master
copies to new tape technologies over time to extend archive life, etc. etc.
If we say that Wikimedia Commons currently has ~10M images, and if allow
1Mbyte per image, that's only 10 TB: that will fit nicely on seven LTO5
tapes. If you use LTFS, you can also make data access and long-term
data robustness easier. If you like, you can slip in a complete dump of
the Mediawiki source and Commons database on each tape, as well.
Even if I'm wrong by an order of magnitude, and 140 tapes are needed,
instead of 14, that's still less than $10k of media -- and I wouldn't be
surprised if tape storage companies wouldn't be eager to vie to be the
company that can claim it donates the media and drives which provide
Wikipedia's long-term backup system.
With two tape drives being run at once at an optimal 140 MB/s each, the
whole backup would take less than a day. Even if I was wrong about both
the writing speed and archive size by an order of magnitude each, this
would still be less than three months.
The same tape systems could also, trivally, be used to back up all the
other WMF sites, on similar lines.
-- Neil
More information about the wikimedia-l
mailing list