On 25 March 2011 18:21, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η
James Linden
έγραψε:
So, thoughts on this? Is 'Move Dumping Process to
another language' a
good idea at all?
I'd worry a lot less about what languages are used than whether the process
itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys
admin, I'd think that adding another environment stack requirement (in
the case of C# or Java) to the overall architecture would be a bad
idea in general.
> The current dump process (which I created in 2004-2005 when we had a LOT
> less data, and a LOT fewer computers) is very linear, which makes it awkward
> to scale up:
>
> * pull a list of all page revisions, in page/rev order
> * as they go through, pump page/rev data to a linear XML stream
> * pull that linear XML stream back in again, as well as the last time's
> completed linear XML stream
> * while going through those, combine the original page text from the last
> XML dump, or from the current database, and spit out a linear XML stream
> containing both page/rev data and rev text
> * and also stick compression on the end
>
> About the only way we can scale it beyond a couple of CPUs
> (compression/decompression as separate processes from the main PHP stream
> handler) is to break it into smaller linear pieces and either reassemble
> them, or require users to reassemble the pieces for linear processing.
TBH users wouldn't have to reassemble the pieces I don't think; they
might be annoyed at having 400 little (or not so little) files lying
around but any processing they meant to do could, I would think, easily
be wrapped in a loop that tossed in each piece in order as input.
> Within each of those linear processes, any
bottleneck will slow everything
> down whether that's bzip2 or 7zip compression/decompression, fetching
> revisions from the wiki's complex storage systems, the XML parsing, or
> something in the middle.
>
> What I'd recommend looking at is ways to actually rearrange the data so a)
> there's less work that needs to be done to create a new dump and b) most of
> that work can be done independently of other work that's going on, so it's
> highly scalable.
>
> Ideally, anything that hasn't changed since the last dump shouldn't need
> *any* new data processing (right now it'll go through several stages of
> slurping from a DB, decompression and recompression, XML parsing and
> re-structuring, etc). A new dump should consist basically of running through
> appending new data and removing deleted data, without touching the things
> that haven't changed.
One assumption here is that there is a previous dump to work from;
that's not always true, and we should be able to run a dump "from
scratch" without it needing to take 3 months for en wiki.
A second assumption is that the previous dump data is sound; we've also
seen that fail to be true. This means that we need to be able to check
the contents against the database contents in some fashion. Currently
we look at revision length for each revision, but that's not foolproof
(and it's also still too slow).
However if verification meant just that, verification instead of
rerwiting a new file with the additional costs that compression imposes
on us, we would see some gains immediately.
This may
actually need a fancier structured data file format, or perhaps a
sensible directory structure and subfile structure -- ideally one that's
friendly to beed updated via simple things like rsync.
I'm probably stating the obvious here...
Breaking the dump up by article namespace might be a starting point --
have 1 controller process for each namespace. That leaves 85% of the
work in the default namespace, which could them be segmented by any
combination of factors, maybe as simple as block batches of X number
of articles.
We already have the mechanism for running batches of arbitrary numbers
of articles. That's what the en history dumps do now.
What we don't have is:
* a way to run easily over multiple hosts
* a way to recombine small pieces into larger files for download that
isn't serial, *or* alternatively a format that relies on multiple small
pieces so we can skip recombining
* a way to check previous content for integrity *quickly* before folding
it into the current dumps (we check each revision separately, much too
slow)
* a way to "fold previous content into the current dumps" that consists
of making a straight copy of what's on disk with no processing. (What
do we do if something has been deleted or moved, or is corrupt? The
existing format isn't friendly to those cases.)
When I'm importing the XML dump to MySQL, I
have one process that
reads the XML file, and X processes (10 usually) working in parallel
to parse each article block on a first-available queue system. My
current implementation is a bit cumbersome, but maybe the idea could
be used for building the dump as well?
In general, I'm interested in pitching in some effort on anything
related to the dump/import processes.
Glad to hear it! Drop by irc please, I'm in the usual channels. :-)
Just a thought, wouldn't it be easier to generate dumps in parallel if
we did away with the assumption that the dump would be in database
order. The metadata in the dump provides the ordering info for the
people that require it.
Andrew Dunbar (hippietrail)
Ariel
--------------------------------------
James Linden
kodekrash(a)gmail.com
--------------------------------------
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l