On Fri, Aug 12, 2011 at 6:55 AM, David Gerard dgerard@gmail.com wrote:
[posted to foundation-l and wikitech-l, thread fork of a discussion
elsewhere]
THESIS: Our inadvertent monopoly is *bad*. We need to make it easy to fork the projects, so as to preserve them.
This is the single point of failure problem. The reasons for it having happened are obvious, but it's still a problem. Blog posts (please excuse me linking these yet again):
- http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
- http://davidgerard.co.uk/notes/2011/01/19/single-point-of-failure/
I dream of the encyclopedia being meaningfully backed up. This will require technical attention specifically to making the projects - particularly that huge encyclopedia in English - meaningfully forkable.
Yes, we should be making ourselves forkable. That way people don't *have* to trust us.
We're digital natives - we know the most effective way to keep something safe is to make sure there's lots of copies around.
How easy is it to set up a copy of English Wikipedia - all text, all pictures, all software, all extensions and customisations to the software? What bits are hard? If a sizable chunk of the community wanted to fork, how can we make it *easy* for them to do so?
Software and customizations are pretty easy -- that's all in SVN, and most of the config files are also made visible on noc.wikimedia.org.
If you're running a large site there'll be more 'tips and tricks' in the actual setup that you may need to learn; most documentation on the setups should be on wikitech.wikimedia.org, and do feel free to ask for details on anything that might seem missing -- it should be reasonably complete. But to just keep a data set, it's mostly a matter of disk space, bandwidth, and getting timely updates.
For data there are three parts:
* page data -- everything that's not deleted/oversighted is in the public dumps at download.wikimedia.org, but may be a bit slow to build/process due to the dump system's history; it doesn't scale as well as we really want with current data size.
More to the point, getting data isn't enough for a "working" fork - a wiki without a community is an empty thing, so being able to move data around between different sites (merging changes, distributing new articles) would be a big plus.
This is a bit awkward with today's MediaWiki (though I tjimk I've seen some exts aiming to help); DVCSs like git show good ways to do this sort of thing -- forking a project on/from a git hoster like github or gitorious is usually the first step to contributing upstream! This is healthy and should be encouraged for wikis, too.
* media files -- these are freely copiable but I'm not sure the state of easily obtaing them in bulk. As the data set moved into TB it became impractical to just build .tar dumps. There are batch downloader tools available, and the metadata's all in dumps and api.
* user data -- watchlists, emails, passwords, prefs are not exported in bulk, but you can always obtain your own info so an account migration tool would not be hard to devise.
And I ask all this knowing that we don't have the paid tech resources to look into it - tech is a huge chunk of the WMF budget and we're still flat-out just keeping the lights on. But I do think it needs serious consideration for long-term preservation of all this work.
This is part of WMF's purpose, actually, so I'll disagree on that point. That's why for instance we insist on using so much open source -- we *want* everything we do to be able to be reused or rebuilt independently of us.
-- brion
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l