On 01/02/12 04:56, MZMcBride wrote:
Platonides wrote:
On 29/01/12 22:48, MZMcBride wrote:
Hi.
Is there a list of (current) impediments to forking a Wikimedia wiki?
replicating parser,
The parser is
publically available. We have no hidden tricks.
available dumps,
Dumps are running quite
well currently. That shouldn't be a problem
Image access,
Downloading images is
slightly harder. Although the big problem is the
huge amount of them (so big size, which translates into needed disk
space + bandwidth to download), not in getting the files.
If you weren't interested in really forking the images,
$wgUseInstantCommons could be used.
replicating user table, etc.
All user data
deemed private is not available. Basically password
hashes, watchlists and some preferences.
I appreciate the inline replies, but these were just ideas of the top of my
head. ;-) I was asking about a more thorough review.
It was easier to analyse the suggested problems than to think new ones ;)
Also, replicating the parser is fairly difficult. Link
existence checks,
interwikis, image rendering with foreign repos, extension tags (math
support, hiero support, syntax highlighting), Tidy interaction, etc. make
actual replication very difficult. You can approximate, though.
I disagree. You don't need to "replicate" the parser. The parser is
already there (thanksfully!). You need to install all the (relevant)
extensions enabled on WMF wikis, but that's simple (maybe I'm so used to
mediawiki I'm not able to appreciate the difficulty?).
If you want to keep the mirror synced (ie. not just a snapshot), that's
a bit harder, but simulating edits (eg. importText), it should go quite
well.
Apart from the parser, depending on the level to which
you want to
replicate, certain parts of public data still aren't dumped, I think. The
user table isn't dumped publicly, as I recall, not even in sanitized form.
So you'd need a lot of API requests or the Toolserver there. These are the
types of impediments it'd be nice to document....
Good point.
The data is available at
http://en.wikipedia.org/w/api.php?action=query&list=allusers&aulimiā¦
but with 16201989 users, those are 32404 requests!
Although if you only care about mirroring the data, you may not include
the userlist or lazy-load them, as they begin editing.
(you could retrieve most users from logging.sql, but old account
creations weren't logged...)