Re: [Wikitech-l] Moving the Dump Process to another language

24 Mar 2011


      On Thu, Mar 24, 2011 at 1:05 PM, Yuvi Panda yuvipanda@gmail.com wrote:
...
Hi, I'm Yuvi, a student looking forward to working with MediaWiki via
this year's GSoC.
I want to work on something dump related, and have been bugging
apergos (Ariel) for a while now. One of the things that popped up into
my head is moving the dump process to another language (say, C#, or
Java, or be very macho and do C++ or C). This would give the dump
process quite a bit of a speed bump (The profiling I did[1] seems to
indicate that the DB is not the bottleneck. Might be wrong though),
and can also be done in a way that makes running distributed dumps
easier/more elegant.
So, thoughts on this? Is 'Move Dumping Process to another language' a
good idea at all?
I'd worry a lot less about what languages are used than whether the process
itself is scalable.
The current dump process (which I created in 2004-2005 when we had a LOT
less data, and a LOT fewer computers) is very linear, which makes it awkward
to scale up:
* pull a list of all page revisions, in page/rev order
  * as they go through, pump page/rev data to a linear XML stream
* pull that linear XML stream back in again, as well as the last time's
completed linear XML stream
  * while going through those, combine the original page text from the last
XML dump, or from the current database, and spit out a linear XML stream
containing both page/rev data and rev text
  * and also stick compression on the end
About the only way we can scale it beyond a couple of CPUs
(compression/decompression as separate processes from the main PHP stream
handler) is to break it into smaller linear pieces and either reassemble
them, or require users to reassemble the pieces for linear processing.
Within each of those linear processes, any bottleneck will slow everything
down whether that's bzip2 or 7zip compression/decompression, fetching
revisions from the wiki's complex storage systems, the XML parsing, or
something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a)
there's less work that needs to be done to create a new dump and b) most of
that work can be done independently of other work that's going on, so it's
highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need
*any* new data processing (right now it'll go through several stages of
slurping from a DB, decompression and recompression, XML parsing and
re-structuring, etc). A new dump should consist basically of running through
appending new data and removing deleted data, without touching the things
that haven't changed.
This may actually need a fancier structured data file format, or perhaps a
sensible directory structure and subfile structure -- ideally one that's
friendly to beed updated via simple things like rsync.
-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Moving the Dump Process to another language