Re: [Wikitech-l] Moving the Dump Process to another language

25 Mar 2011

      Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η James Linden
έγραψε:
...
...
...
So, thoughts on this? Is 'Move Dumping Process to another language' a
good idea at all?
I'd worry a lot less about what languages are used than whether the process
itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys
admin, I'd think that adding another environment stack requirement (in
the case of C# or Java) to the overall architecture would be a bad
idea in general.
...
The current dump process (which I created in 2004-2005 when we had a LOT
less data, and a LOT fewer computers) is very linear, which makes it awkward
to scale up:

pull a list of all page revisions, in page/rev order
as they go through, pump page/rev data to a linear XML stream
pull that linear XML stream back in again, as well as the last time's

completed linear XML stream

while going through those, combine the original page text from the last

XML dump, or from the current database, and spit out a linear XML stream
containing both page/rev data and rev text

and also stick compression on the end

About the only way we can scale it beyond a couple of CPUs
(compression/decompression as separate processes from the main PHP stream
handler) is to break it into smaller linear pieces and either reassemble
them, or require users to reassemble the pieces for linear processing.
TBH users wouldn't have to reassemble the pieces I don't think; they
might be annoyed at having 400 little (or not so little) files lying
around but any processing they meant to do could, I would think, easily
be wrapped in a loop that tossed in each piece in order as input.
...
...
Within each of those linear processes, any bottleneck will slow everything
down whether that's bzip2 or 7zip compression/decompression, fetching
revisions from the wiki's complex storage systems, the XML parsing, or
something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a)
there's less work that needs to be done to create a new dump and b) most of
that work can be done independently of other work that's going on, so it's
highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need
*any* new data processing (right now it'll go through several stages of
slurping from a DB, decompression and recompression, XML parsing and
re-structuring, etc). A new dump should consist basically of running through
appending new data and removing deleted data, without touching the things
that haven't changed.
One assumption here is that there is a previous dump to work from;
that's not always true, and we should be able to run a dump "from
scratch" without it needing to take 3 months for en wiki.
A second assumption is that the previous dump data is sound; we've also
seen that fail to be true.  This means that we need to be able to check
the contents against the database contents in some fashion.  Currently
we look at revision length for each revision, but that's not foolproof
(and it's also still too slow).
However if verification meant just that, verification instead of
rerwiting a new file with the additional costs that compression imposes
on us, we would see some gains immediately.
...
...
This may actually need a fancier structured data file format, or perhaps a
sensible directory structure and subfile structure -- ideally one that's
friendly to beed updated via simple things like rsync.
I'm probably stating the obvious here...
Breaking the dump up by article namespace might be a starting point --
have 1 controller process for each namespace. That leaves 85% of the
work in the default namespace, which could them be segmented by any
combination of factors, maybe as simple as block batches of X number
of articles.
We already have the mechanism for running batches of arbitrary numbers
of articles.   That's what the en history dumps do now.
What we don't have is:
* a way to run easily over multiple hosts
* a way to recombine small pieces into larger files for download that
isn't serial, *or* alternatively a format that relies on multiple small
pieces so we can skip recombining 
* a way to check previous content for integrity *quickly* before folding
it into the current dumps (we check each revision separately, much too
slow)
* a way to "fold previous content into the current dumps" that consists
of making a straight copy of what's on disk with no processing.  (What
do we do if something has been deleted or moved, or is corrupt?  The
existing format isn't friendly to those cases.)
...
When I'm importing the XML dump to MySQL, I have one process that
reads the XML file, and X processes (10 usually) working in parallel
to parse each article block on a first-available queue system. My
current implementation is a bit cumbersome, but maybe the idea could
be used for building the dump as well?
In general, I'm interested in pitching in some effort on anything
related to the dump/import processes.
Glad to hear it!  Drop by irc please, I'm in the usual channels. :-)
Ariel
...

James Linden
kodekrash@gmail.com

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Moving the Dump Process to another language