Scraping:
Jeff Merkey was downloading all images used by enwiki in a day and a half using 16 workstations just a few months ago with his wikix tool so this is definitely possible.
Not to be done by too much people too often of course but seldom are those that have enougth bandwith anyway.
He was actually redistributing this image dump through torrent but it was taking a week to download it. As it was faster to download them from WP, he killed the tracker.
There is some info in this mailing list history (look around March/April) and on the net. The Linux executable is < ftp://www.wikigadugi.org/wiki/MediaWiki/wikix.tar.gz.bz2> here and it requires only a XML dump to work.
If you want a torrent dump again, maybe he can provide one again if you ask him politely.
Jerome
2008/1/13, Robert Rohde rarohde@gmail.com:
On Jan 13, 2008 5:56 AM, Anthony wikimail@inbox.org wrote:
On Jan 13, 2008 6:51 AM, Robert Rohde < rarohde@gmail.com> wrote:
On 1/13/08, David Gerard dgerard@gmail.com wrote:
<snip> One of the best protections we have against the Foundation being
taken
over by insane space aliens is good database dumps.
And how long has it been since we had good database dumps?
We haven't had an image dump in ages, and most of the major projects (enwiki, dewiki, frwiki, commons) routinely fail to generate full
history
dumps.
I assume it's not intentional, but at the moment it would be very
difficult
to fork the major projects in anything approaching a comprehensive
way.
You don't really need the full database dump to fork. All you need is the current database dump and the stub dump with the list of authors. You'd lose some textual information this way, but not really that much. And with the money and time you'd have to put into creating a viable fork it wouldn't be hard to get the rest through scraping and/or a live feed purchase anyway.
<snip>
For several months enwiki's stub-meta-history has also failed (albeit silently, you don't notice unless you try downloading it). There is no dump at all that contains all of enwiki's contribution history.
As for scraping, don't kid yourself and think that is easy. I've run large scale scraping efforts in the past. For enwiki you are talking about >2 million images in 2.1 million articles with 35 million edits. A friendly scraper (e.g. one that paused a second or so between requests) could easily be running a few hundred days if it wanted to grab all of the images and edit history. An unfriendly, mutli-threaded scraper could of course do better, but it would still likely take a couple weeks.
-Robert Rohde _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: http://lists.wikimedia.org/mailman/listinfo/foundation-l