It is good that we will have new disks and it likely won't get stuck; but that doesn't address the primary problem of the length of time these things take. Let me try to be more constructive.
First thing is that the projects are hugely different in size. This causes a fundamental queuing problem: with n threads, and more than n huge tasks in the queue, the threads will all end up doing those. (we recently saw a number of days in which it was working on enwiki, frwiki, dewiki, and jawiki and nothing else). This can be fixed with a thread that is restricted to smaller tasks. Like in a market or a bank, with an express lane. (My bank has one teller only for deposits and withdrawals in 500s and 1000s notes, no other transactions.)
However there are other problems and opportunities. Each project does a number of minor tasks, and then 4 larger ones:
* main articles, current versions * all pages, current * all-history of all pages, bz2 compressed * all-history, re-compressed in 7z
For the enwiki, main articles takes (7/14 numbers) 10 hours, 30 min, all pages 16 hours, 10 min. All-history bz2 was estimated at 67 days when it got stuck. (would have been shorter, as that estimate was right at the start) For jawiki (7/24): main was 48 min, all pages 65 min, all history bz2 3 days, 18 hours, 7z 2 days, 12 hours.
Some observations then:
* the main articles dump is a subset of all pages. The latter might usefully be only a dump of all the pages *not* in the first. * alternatively, the process could dump the first, then copy the file and continue with the others for the second (yes, one has to be careful with the bz2 compression state) * or it could write both at the same time, saving the DB access time if not the compression time * the all-history dump might be only the 7z. Yes, it takes longer than the bz2, but direct to 7z will be much less total time. * alternatively, write both bz2 and 7z at the same time (if we must have the bz2, but I don't see why; methinks anyone would want the 7z) * make the all-history dump(s) separate tasks, in separate queue(s); without them the rest will go very well
Note that the all-history dumps are cumulative: each contains everything that was in the previous, plus all the new versions. We might reconsider whether we want those at all, or make each an incremental. (I'm not sure what these are for exactly) A dump that is taken over a several month period is also hardly a snapshot, from a DB integrity POV it is nearly useless. But no matter.
* the format of the all-history dump could be changed to store only differences (going backward from current) in each XML record * or a variant of the 7z compressor used that knows where to search for the matching strings, rather than a general search; it would then be *much* faster. (as it is an LZ77-class method, this doesn't change the decompressor logic)
Either of these last two would make the all-history dumps at least a couple of orders of magnitude faster.
best regards, Robert