So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?
I'd worry a lot less about what languages are used than whether the process itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general.
The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:
- pull a list of all page revisions, in page/rev order
* as they go through, pump page/rev data to a linear XML stream
- pull that linear XML stream back in again, as well as the last time's
completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end
About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.
Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.
This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.
I'm probably stating the obvious here...
Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles.
When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well?
In general, I'm interested in pitching in some effort on anything related to the dump/import processes.
-------------------------------------- James Linden kodekrash@gmail.com --------------------------------------