On Fri, Aug 12, 2011 at 6:55 AM, David Gerard <dgerard(a)gmail.com> wrote:
[posted to foundation-l and wikitech-l, thread fork of a discussion
elsewhere]
THESIS: Our inadvertent monopoly is *bad*. We need to make it easy to
fork the projects, so as to preserve them.
This is the single point of failure problem. The reasons for it having
happened are obvious, but it's still a problem. Blog posts (please
excuse me linking these yet again):
*
http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
*
http://davidgerard.co.uk/notes/2011/01/19/single-point-of-failure/
I dream of the encyclopedia being meaningfully backed up. This will
require technical attention specifically to making the projects -
particularly that huge encyclopedia in English - meaningfully
forkable.
Yes, we should be making ourselves forkable. That way people don't
*have* to trust us.
We're digital natives - we know the most effective way to keep
something safe is to make sure there's lots of copies around.
How easy is it to set up a copy of English Wikipedia - all text, all
pictures, all software, all extensions and customisations to the
software? What bits are hard? If a sizable chunk of the community
wanted to fork, how can we make it *easy* for them to do so?
Software and customizations are pretty easy -- that's all in SVN, and most
of the config files are also made visible on
noc.wikimedia.org.
If you're running a large site there'll be more 'tips and tricks' in the
actual setup that you may need to learn; most documentation on the setups
should be on
wikitech.wikimedia.org, and do feel free to ask for details on
anything that might seem missing -- it should be reasonably complete. But to
just keep a data set, it's mostly a matter of disk space, bandwidth, and
getting timely updates.
For data there are three parts:
* page data -- everything that's not deleted/oversighted is in the public
dumps at
download.wikimedia.org, but may be a bit slow to build/process due
to the dump system's history; it doesn't scale as well as we really want
with current data size.
More to the point, getting data isn't enough for a "working" fork - a wiki
without a community is an empty thing, so being able to move data around
between different sites (merging changes, distributing new articles) would
be a big plus.
This is a bit awkward with today's MediaWiki (though I tjimk I've seen some
exts aiming to help); DVCSs like git show good ways to do this sort of thing
-- forking a project on/from a git hoster like github or gitorious is
usually the first step to contributing upstream! This is healthy and should
be encouraged for wikis, too.
* media files -- these are freely copiable but I'm not sure the state of
easily obtaing them in bulk. As the data set moved into TB it became
impractical to just build .tar dumps. There are batch downloader tools
available, and the metadata's all in dumps and api.
* user data -- watchlists, emails, passwords, prefs are not exported in
bulk, but you can always obtain your own info so an account migration tool
would not be hard to devise.
And I ask all this knowing that we don't have the
paid tech resources
to look into it - tech is a huge chunk of the WMF budget and we're
still flat-out just keeping the lights on. But I do think it needs
serious consideration for long-term preservation of all this work.
This is part of WMF's purpose, actually, so I'll disagree on that point.
That's why for instance we insist on using so much open source -- we *want*
everything we do to be able to be reused or rebuilt independently of us.
-- brion
- d.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l