On Fri, Aug 12, 2011 at 6:55 AM, David Gerard dgerard@gmail.com wrote:
[posted to foundation-l and wikitech-l, thread fork of a discussion
elsewhere]
THESIS: Our inadvertent monopoly is *bad*. We need to make it easy to fork the projects, so as to preserve them.
This is the single point of failure problem. The reasons for it having happened are obvious, but it's still a problem. Blog posts (please excuse me linking these yet again):
- http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
- http://davidgerard.co.uk/notes/2011/01/19/single-point-of-failure/
I dream of the encyclopedia being meaningfully backed up. This will require technical attention specifically to making the projects - particularly that huge encyclopedia in English - meaningfully forkable.
Yes, we should be making ourselves forkable. That way people don't *have* to trust us.
We're digital natives - we know the most effective way to keep something safe is to make sure there's lots of copies around.
How easy is it to set up a copy of English Wikipedia - all text, all pictures, all software, all extensions and customisations to the software? What bits are hard? If a sizable chunk of the community wanted to fork, how can we make it *easy* for them to do so?
Software and customizations are pretty easy -- that's all in SVN, and most of the config files are also made visible on noc.wikimedia.org.
If you're running a large site there'll be more 'tips and tricks' in the actual setup that you may need to learn; most documentation on the setups should be on wikitech.wikimedia.org, and do feel free to ask for details on anything that might seem missing -- it should be reasonably complete. But to just keep a data set, it's mostly a matter of disk space, bandwidth, and getting timely updates.
For data there are three parts:
* page data -- everything that's not deleted/oversighted is in the public dumps at download.wikimedia.org, but may be a bit slow to build/process due to the dump system's history; it doesn't scale as well as we really want with current data size.
More to the point, getting data isn't enough for a "working" fork - a wiki without a community is an empty thing, so being able to move data around between different sites (merging changes, distributing new articles) would be a big plus.
This is a bit awkward with today's MediaWiki (though I tjimk I've seen some exts aiming to help); DVCSs like git show good ways to do this sort of thing -- forking a project on/from a git hoster like github or gitorious is usually the first step to contributing upstream! This is healthy and should be encouraged for wikis, too.
* media files -- these are freely copiable but I'm not sure the state of easily obtaing them in bulk. As the data set moved into TB it became impractical to just build .tar dumps. There are batch downloader tools available, and the metadata's all in dumps and api.
* user data -- watchlists, emails, passwords, prefs are not exported in bulk, but you can always obtain your own info so an account migration tool would not be hard to devise.
And I ask all this knowing that we don't have the paid tech resources to look into it - tech is a huge chunk of the WMF budget and we're still flat-out just keeping the lights on. But I do think it needs serious consideration for long-term preservation of all this work.
This is part of WMF's purpose, actually, so I'll disagree on that point. That's why for instance we insist on using so much open source -- we *want* everything we do to be able to be reused or rebuilt independently of us.
-- brion
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 12 August 2011 12:44, Brion Vibber brion@pobox.com wrote:
On Fri, Aug 12, 2011 at 6:55 AM, David Gerard dgerard@gmail.com wrote:
And I ask all this knowing that we don't have the paid tech resources to look into it - tech is a huge chunk of the WMF budget and we're still flat-out just keeping the lights on. But I do think it needs serious consideration fo r long-term preservation of all this work.
This is part of WMF's purpose, actually, so I'll disagree on that point. That's why for instance we insist on using so much open source -- we *want* everything we do to be able to be reused or rebuilt independently of us.
I'm speaking of making it happen, not whether it's an acknowledged need, which I know it is :-) It's an obvious Right Thing. But we have X dollars to do everything with, so more to this means less to somewhere else. And this is a variety of technical debt, and tends to get put in an eternal to-do list with the rest of the technical debt.
So it would need someone actively pushing it. I'm not even absolutely sure myself it's a priority item that someone should take up as a cause. I do think the communities need reminding of it from time to time, however.
- d.
On 12 August 2011 12:44, Brion Vibber brion@pobox.com wrote:
- user data -- watchlists, emails, passwords, prefs are not exported in
bulk, but you can always obtain your own info so an account migration tool would not be hard to devise.
This one's tricky, because that's not free content, for good reason. It would need to be present for correct attribution at the least. I don't see anything intrinsically hard about that - have I missed anything about it that makes it hard?
- d.
On 12/08/2011 10:31 PM, David Gerard wrote:
This one's tricky, because that's not free content, for good reason. It would need to be present for correct attribution at the least. I don't see anything intrinsically hard about that - have I missed anything about it that makes it hard?
Well you'd need to have namespaces for username's, and that's about it. Or you could pursue something like OpenID as you mentioned.
Of course if you used the user database "as is" and pursued my proposed model for content mirroring, you could have an 'Attribution' tab for mirrored content up near the 'Page' and 'Discussion' tabs, and in that page show a list of everyone who had contributed to the content. You could update this list from time-to-time, at the same time as you did your mirroring. You could go as far as mentioning the number of edits particular users had made. It wouldn't be the same type of "blow by blow" attribution that you get where you can see a log of specifically what contributions particular users had made, but it would be a suitable attribution nonetheless, similar to the attribution at:
On 12/08/2011 10:44 PM, John Elliot wrote:
It wouldn't be the same type of "blow by blow" attribution that you get where you can see a log of specifically what contributions particular users had made
Although I guess it would be possible to go all out and support that too. You could leave the local user database as-is, and introduce a remote user database that included a namespace, such as en.wikipedia.org, for usernames. For mirrored content you'd reference the remote user database, and for local content reference the local user database.
wikitech-l@lists.wikimedia.org