Benj. Mako Hill, 29/05/2014 18:27:
Without question, the current dumps put together by
WikiTeam are an
awesome resource for folks wanting to do Wikia research.
Thanks. I hope someone will use them. :-)
That said,
they are a strange sample and it's not clear how they are
representative of other Wikia wikis. This makes it hard to use the
sample to confidently answer a question like Piotr's.
Earlier dumps are basically random, but the one we made last winter
should include (save some errors) all the biggest wikis.
Basically, logged-in users have to "request" every dump individually
and by hand. Once a dump is requested, it will be created and put in
S3 and then seems to be kept around for at least several months. I've
found some shockingly big and important wikis without dumps and 14k is
a tiny proportion of all wikis! :-(
Wikia has some 400k wikis, but at least 350k of them have only one ns0
page. Some of the "shockingly big" wikis may be excluded from dumps for
copyright reasons (the biggest example is lyricswiki).
If I can help or provide resources to help get a new comprehensive
set of Wikia dumps, let me know.
Other than bugfixes for wikiteam [1] what we'd like to have is an up to
date list of all relevant (or non-empty) Wikia wikis, say 20-30k
biggest. The list I used was given me by an unnamed person a few years
ago and I've always been too lazy to update it. It doesn't take much if
you're not afraid of hitting Wikia APIs a bit. ;-)
https://bugzilla.wikimedia.org/show_bug.cgi?id=59943
Nemo
[1]
https://code.google.com/p/wikiteam/issues/list