On 25 March 2011 18:21, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η James Linden έγραψε:
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?
I'd worry a lot less about what languages are used than whether the process itself is scalable.
I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general.
The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:
- pull a list of all page revisions, in page/rev order
* as they go through, pump page/rev data to a linear XML stream
- pull that linear XML stream back in again, as well as the last time's
completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end
About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.
TBH users wouldn't have to reassemble the pieces I don't think; they might be annoyed at having 400 little (or not so little) files lying around but any processing they meant to do could, I would think, easily be wrapped in a loop that tossed in each piece in order as input.
Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.
What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.
Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.
One assumption here is that there is a previous dump to work from; that's not always true, and we should be able to run a dump "from scratch" without it needing to take 3 months for en wiki.
A second assumption is that the previous dump data is sound; we've also seen that fail to be true. This means that we need to be able to check the contents against the database contents in some fashion. Currently we look at revision length for each revision, but that's not foolproof (and it's also still too slow).
However if verification meant just that, verification instead of rerwiting a new file with the additional costs that compression imposes on us, we would see some gains immediately.
This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.
I'm probably stating the obvious here...
Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles.
We already have the mechanism for running batches of arbitrary numbers of articles. That's what the en history dumps do now.
What we don't have is:
- a way to run easily over multiple hosts
- a way to recombine small pieces into larger files for download that
isn't serial, *or* alternatively a format that relies on multiple small pieces so we can skip recombining
- a way to check previous content for integrity *quickly* before folding
it into the current dumps (we check each revision separately, much too slow)
- a way to "fold previous content into the current dumps" that consists
of making a straight copy of what's on disk with no processing. (What do we do if something has been deleted or moved, or is corrupt? The existing format isn't friendly to those cases.)
When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well?
In general, I'm interested in pitching in some effort on anything related to the dump/import processes.
Glad to hear it! Drop by irc please, I'm in the usual channels. :-)
Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it.
Andrew Dunbar (hippietrail)
Ariel
James Linden kodekrash@gmail.com
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l