Ariel T. Glenn wrote:
Amusingly, splitting based on some number of articles doesn't really balance out the pieces, at least for history dumps, after the project has been around long enough with enough activity. Splitting by number of revisions is what we really want, and the older pages have many many more revs than later pages.
Right. That would only work for pages-articles, not for pages-history. But splitting the revisions on different files makes no sense. You could however get an approximation if instead of giving out pages in strict order, they are given to the workers as soon as they are ready. Workers with pages holding many revisions will take longer, while those with will come back again shortly. I think it would correlate quite well to the number of revisions. You would be balancing between workers the time needed (which is what we really care about).