Re: [Wikitech-l] Moving the Dump Process to another language

25 Mar 2011

...
   So, thoughts
on this? Is 'Move Dumping Process to another language' a
 good idea at all?

 I'd worry a lot less about what languages are used than whether the process
 itself is scalable. 
I'm not a mediawiki / wikipedia developer, but as a developer / sys
admin, I'd think that adding another environment stack requirement (in
the case of C# or Java) to the overall architecture would be a bad
idea in general.

...
  The current dump process (which I created in 2004-2005
when we had a LOT
 less data, and a LOT fewer computers) is very linear, which makes it awkward
 to scale up:

 * pull a list of all page revisions, in page/rev order
  * as they go through, pump page/rev data to a linear XML stream
 * pull that linear XML stream back in again, as well as the last time's
 completed linear XML stream
  * while going through those, combine the original page text from the last
 XML dump, or from the current database, and spit out a linear XML stream
 containing both page/rev data and rev text
  * and also stick compression on the end

 About the only way we can scale it beyond a couple of CPUs
 (compression/decompression as separate processes from the main PHP stream
 handler) is to break it into smaller linear pieces and either reassemble
 them, or require users to reassemble the pieces for linear processing.

 Within each of those linear processes, any bottleneck will slow everything
 down whether that's bzip2 or 7zip compression/decompression, fetching
 revisions from the wiki's complex storage systems, the XML parsing, or
 something in the middle.

 What I'd recommend looking at is ways to actually rearrange the data so a)
 there's less work that needs to be done to create a new dump and b) most of
 that work can be done independently of other work that's going on, so it's
 highly scalable.

 Ideally, anything that hasn't changed since the last dump shouldn't need
 *any* new data processing (right now it'll go through several stages of
 slurping from a DB, decompression and recompression, XML parsing and
 re-structuring, etc). A new dump should consist basically of running through
 appending new data and removing deleted data, without touching the things
 that haven't changed.

 This may actually need a fancier structured data file format, or perhaps a
 sensible directory structure and subfile structure -- ideally one that's
 friendly to beed updated via simple things like rsync. 
I'm probably stating the obvious here...

Breaking the dump up by article namespace might be a starting point --
have 1 controller process for each namespace. That leaves 85% of the
work in the default namespace, which could them be segmented by any
combination of factors, maybe as simple as block batches of X number
of articles.

When I'm importing the XML dump to MySQL, I have one process that
reads the XML file, and X processes (10 usually) working in parallel
to parse each article block on a first-available queue system. My
current implementation is a bit cumbersome, but maybe the idea could
be used for building the dump as well?

In general, I'm interested in pitching in some effort on anything
related to the dump/import processes.

--------------------------------------
James Linden
kodekrash(a)gmail.com
--------------------------------------

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Moving the Dump Process to another language