Impediments to forking - Wikitech-l - lists.wikimedia.org

List overview All Threads
Download

Impediments to forking

Some questions about #switch

Production configs (to ops)

MZMcBride

29 Jan 2012 29 Jan '12

10:48 p.m.

Hi. Is there a list of (current) impediments to forking a Wikimedia wiki? Image access, available dumps, replicating parser, replicating user table, etc. MZMcBride

Reply

Show replies by date

Platonides

29 Jan 29 Jan

11:15 p.m.

On 29/01/12 22:48, MZMcBride wrote:

Hi. Is there a list of (current) impediments to forking a Wikimedia wiki?

replicating parser,

The parser is publically available. We have no hidden tricks.

available dumps,

Dumps are running quite well currently. That shouldn't be a problem

Image access,

Downloading images is slightly harder. Although the big problem is the huge amount of them (so big size, which translates into needed disk space + bandwidth to download), not in getting the files. If you weren't interested in really forking the images, $wgUseInstantCommons could be used.

replicating user table, etc.

All user data deemed private is not available. Basically password hashes, watchlists and some preferences.

Reply

MZMcBride

1 Feb 1 Feb

4:56 a.m.

Platonides wrote:

On 29/01/12 22:48, MZMcBride wrote:

Hi. Is there a list of (current) impediments to forking a Wikimedia wiki?

replicating parser,

The parser is publically available. We have no hidden tricks.

available dumps,

Dumps are running quite well currently. That shouldn't be a problem

Image access,

Downloading images is slightly harder. Although the big problem is the huge amount of them (so big size, which translates into needed disk space + bandwidth to download), not in getting the files. If you weren't interested in really forking the images, $wgUseInstantCommons could be used.

replicating user table, etc.

All user data deemed private is not available. Basically password hashes, watchlists and some preferences.

I appreciate the inline replies, but these were just ideas of the top of my head. ;-) I was asking about a more thorough review. Also, replicating the parser is fairly difficult. Link existence checks, interwikis, image rendering with foreign repos, extension tags (math support, hiero support, syntax highlighting), Tidy interaction, etc. make actual replication very difficult. You can approximate, though. Apart from the parser, depending on the level to which you want to replicate, certain parts of public data still aren't dumped, I think. The user table isn't dumped publicly, as I recall, not even in sanitized form. So you'd need a lot of API requests or the Toolserver there. These are the types of impediments it'd be nice to document.... MZMcBride

Reply

Platonides

11:02 p.m.

On 01/02/12 04:56, MZMcBride wrote:

Platonides wrote:

On 29/01/12 22:48, MZMcBride wrote:

Hi. Is there a list of (current) impediments to forking a Wikimedia wiki?

replicating parser,

The parser is publically available. We have no hidden tricks.

available dumps,

Dumps are running quite well currently. That shouldn't be a problem

Image access,

Downloading images is slightly harder. Although the big problem is the huge amount of them (so big size, which translates into needed disk space + bandwidth to download), not in getting the files. If you weren't interested in really forking the images, $wgUseInstantCommons could be used.

replicating user table, etc.

All user data deemed private is not available. Basically password hashes, watchlists and some preferences.

I appreciate the inline replies, but these were just ideas of the top of my head. ;-) I was asking about a more thorough review.

It was easier to analyse the suggested problems than to think new ones ;)

Also, replicating the parser is fairly difficult. Link existence checks, interwikis, image rendering with foreign repos, extension tags (math support, hiero support, syntax highlighting), Tidy interaction, etc. make actual replication very difficult. You can approximate, though.

I disagree. You don't need to "replicate" the parser. The parser is already there (thanksfully!). You need to install all the (relevant) extensions enabled on WMF wikis, but that's simple (maybe I'm so used to mediawiki I'm not able to appreciate the difficulty?). If you want to keep the mirror synced (ie. not just a snapshot), that's a bit harder, but simulating edits (eg. importText), it should go quite well.

Apart from the parser, depending on the level to which you want to replicate, certain parts of public data still aren't dumped, I think. The user table isn't dumped publicly, as I recall, not even in sanitized form. So you'd need a lot of API requests or the Toolserver there. These are the types of impediments it'd be nice to document....

Good point. The data is available at http://en.wikipedia.org/w/api.php?action=query&list=allusers&aulimi… but with 16201989 users, those are 32404 requests! Although if you only care about mirroring the data, you may not include the userlist or lazy-load them, as they begin editing. (you could retrieve most users from logging.sql, but old account creations weren't logged...)

Reply

David Gerard

29 Jan 29 Jan

11:27 p.m.

On 29 January 2012 21:48, MZMcBride <z(a)mzmcbride.com> wrote:

Is there a list of (current) impediments to forking a Wikimedia wiki? Image access, available dumps, replicating parser, replicating user table, etc.

I suspect someone needs to get a spare PC with a huge disk and a remarkable supply of bandwidth, actually try it, and document the process. This will point up precisely what holes there are in our backup hygiene. - d.

Reply

4468

days inactive

4471

days old

wikitech-l@lists.wikimedia.org

Manage subscription

4 comments

3 participants

tags (0)

participants (3)

David Gerard
MZMcBride
Platonides