[Foundation-l] thoughts on leakages

Sun Jan 13 22:23:00 UTC 2008

On Jan 13, 2008 5:56 AM, Anthony <wikimail at inbox.org> wrote:

> On Jan 13, 2008 6:51 AM, Robert Rohde <rarohde at gmail.com> wrote:
> > On 1/13/08, David Gerard <dgerard at gmail.com> wrote:
> > >
> > > <snip>
> > > One of the best protections we have against the Foundation being taken
> > > over by insane space aliens is good database dumps.
> >
> > And how long has it been since we had good database dumps?
> >
> > We haven't had an image dump in ages, and most of the major projects
> > (enwiki, dewiki, frwiki, commons) routinely fail to generate full
> history
> > dumps.
> >
> > I assume it's not intentional, but at the moment it would be very
> difficult
> > to fork the major projects in anything approaching a comprehensive way.
> >
> You don't really need the full database dump to fork.  All you need is
> the current database dump and the stub dump with the list of authors.
> You'd lose some textual information this way, but not really that
> much.  And with the money and time you'd have to put into creating a
> viable fork it wouldn't be hard to get the rest through scraping
> and/or a live feed purchase anyway.
>
> <snip>

For several months enwiki's stub-meta-history has also failed (albeit
silently, you don't notice unless you try downloading it).  There is no dump
at all that contains all of enwiki's contribution history.

As for scraping, don't kid yourself and think that is easy.  I've run large
scale scraping efforts in the past.  For enwiki you are talking about >2
million images in 2.1 million articles with 35 million edits.  A friendly
scraper (e.g. one that paused a second or so between requests) could easily
be running a few hundred days if it wanted to grab all of the images and
edit history.  An unfriendly, mutli-threaded scraper could of course do
better, but it would still likely take a couple weeks.

-Robert Rohde