On Thu, Mar 24, 2011 at 1:05 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.
I want to work on something dump related, and have been bugging apergos (Ariel) for a while now. One of the things that popped up into my head is moving the dump process to another language (say, C#, or Java, or be very macho and do C++ or C). This would give the dump process quite a bit of a speed bump (The profiling I did[1] seems to indicate that the DB is not the bottleneck. Might be wrong though), and can also be done in a way that makes running distributed dumps easier/more elegant.
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?
I'd worry a lot less about what languages are used than whether the process itself is scalable.
The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:
* pull a list of all page revisions, in page/rev order * as they go through, pump page/rev data to a linear XML stream * pull that linear XML stream back in again, as well as the last time's completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end
About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.
Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.
This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.
-- brion