Re: [Foundation-l] thoughts on leakages

13 Jan 2008


      Scraping:
Jeff Merkey was downloading all images used by enwiki in a day and a half
using 16 workstations just a few months ago with his wikix tool so this
is definitely possible.
Not to be done by too much people too often of course but seldom are
those that have enougth bandwith anyway.
He was actually redistributing this image dump through torrent but it was
taking a week to download it. As it was faster to download them
from WP, he killed the tracker.
There is some info in this mailing list history (look around March/April)
and on the net.
The Linux executable is
< ftp://www.wikigadugi.org/wiki/MediaWiki/wikix.tar.gz.bz2> here
and it requires only a XML dump to work.
If you want a torrent dump again, maybe he can provide one
again if you ask him politely.
Jerome
2008/1/13, Robert Rohde rarohde@gmail.com:
...
On Jan 13, 2008 5:56 AM, Anthony wikimail@inbox.org wrote:
...
On Jan 13, 2008 6:51 AM, Robert Rohde < rarohde@gmail.com> wrote:
...
On 1/13/08, David Gerard dgerard@gmail.com wrote:
...
<snip>
One of the best protections we have against the Foundation being
taken
...
...
...
over by insane space aliens is good database dumps.
And how long has it been since we had good database dumps?
We haven't had an image dump in ages, and most of the major projects
(enwiki, dewiki, frwiki, commons) routinely fail to generate full
history
...
dumps.
I assume it's not intentional, but at the moment it would be very
difficult
...
to fork the major projects in anything approaching a comprehensive
way.
...
...
You don't really need the full database dump to fork.  All you need is
the current database dump and the stub dump with the list of authors.
You'd lose some textual information this way, but not really that
much.  And with the money and time you'd have to put into creating a
viable fork it wouldn't be hard to get the rest through scraping
and/or a live feed purchase anyway.
<snip>
For several months enwiki's stub-meta-history has also failed (albeit
silently, you don't notice unless you try downloading it).  There is no
dump
at all that contains all of enwiki's contribution history.
As for scraping, don't kid yourself and think that is easy.  I've run
large
scale scraping efforts in the past.  For enwiki you are talking about >2
million images in 2.1 million articles with 35 million edits.  A friendly
scraper (e.g. one that paused a second or so between requests) could
easily
be running a few hundred days if it wanted to grab all of the images and
edit history.  An unfriendly, mutli-threaded scraper could of course do
better, but it would still likely take a couple weeks.
-Robert Rohde
_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: http://lists.wikimedia.org/mailman/listinfo/foundation-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] thoughts on leakages