Στις 25-03-2011, ημέρα Παρ, και ώρα 21:49 +0100, ο/η Platonides έγραψε:
Andrew Dunbar wrote:
Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it.
Andrew Dunbar (hippietrail)
I don't see how doing the dumps in a different order allows you to greater parallelism. You can already launch several processes at different points of the set. Giving one every N articles to each process would allow more balanced pieces, but that's not important. You would also skip the work of reading the old dump to the offset, although that's reasonably fast. The important point for having them in this order is the property to keep the pages in the same order as the previous dump.
I'm pretty sure there are a lot of folks out there that, like me, have tools which rely on exactly this property (new/changed stuff shows up at the end).
Amusingly, splitting based on some number of articles doesn't really balance out the pieces, at least for history dumps, after the project has been around long enough with enough activity. Splitting by number of revisions is what we really want, and the older pages have many many more revs than later pages.
Ariel