Ariel T. Glenn wrote:
Amusingly, splitting based on some number of articles
doesn't really
balance out the pieces, at least for history dumps, after the project
has been around long enough with enough activity. Splitting by number
of revisions is what we really want, and the older pages have many many
more revs than later pages.
Right. That would only work for pages-articles, not for pages-history.
But splitting the revisions on different files makes no sense.
You could however get an approximation if instead of giving out pages in
strict order, they are given to the workers as soon as they are ready.
Workers with pages holding many revisions will take longer, while those
with will come back again shortly. I think it would correlate quite well
to the number of revisions. You would be balancing between workers the
time needed (which is what we really care about).