So, thoughts
on this? Is 'Move Dumping Process to another language' a
good idea at all?
I'd worry a lot less about what languages are used than whether the process
itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys
admin, I'd think that adding another environment stack requirement (in
the case of C# or Java) to the overall architecture would be a bad
idea in general.
The current dump process (which I created in 2004-2005
when we had a LOT
less data, and a LOT fewer computers) is very linear, which makes it awkward
to scale up:
* pull a list of all page revisions, in page/rev order
* as they go through, pump page/rev data to a linear XML stream
* pull that linear XML stream back in again, as well as the last time's
completed linear XML stream
* while going through those, combine the original page text from the last
XML dump, or from the current database, and spit out a linear XML stream
containing both page/rev data and rev text
* and also stick compression on the end
About the only way we can scale it beyond a couple of CPUs
(compression/decompression as separate processes from the main PHP stream
handler) is to break it into smaller linear pieces and either reassemble
them, or require users to reassemble the pieces for linear processing.
Within each of those linear processes, any bottleneck will slow everything
down whether that's bzip2 or 7zip compression/decompression, fetching
revisions from the wiki's complex storage systems, the XML parsing, or
something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a)
there's less work that needs to be done to create a new dump and b) most of
that work can be done independently of other work that's going on, so it's
highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need
*any* new data processing (right now it'll go through several stages of
slurping from a DB, decompression and recompression, XML parsing and
re-structuring, etc). A new dump should consist basically of running through
appending new data and removing deleted data, without touching the things
that haven't changed.
This may actually need a fancier structured data file format, or perhaps a
sensible directory structure and subfile structure -- ideally one that's
friendly to beed updated via simple things like rsync.
I'm probably stating the obvious here...
Breaking the dump up by article namespace might be a starting point --
have 1 controller process for each namespace. That leaves 85% of the
work in the default namespace, which could them be segmented by any
combination of factors, maybe as simple as block batches of X number
of articles.
When I'm importing the XML dump to MySQL, I have one process that
reads the XML file, and X processes (10 usually) working in parallel
to parse each article block on a first-available queue system. My
current implementation is a bit cumbersome, but maybe the idea could
be used for building the dump as well?
In general, I'm interested in pitching in some effort on anything
related to the dump/import processes.
--------------------------------------
James Linden
kodekrash(a)gmail.com
--------------------------------------