Robert Ullmann wrote:
It is good that we will have new disks and it likely won't get stuck; but that doesn't address the primary problem of the length of time these things take. Let me try to be more constructive.
It just paralellizes it ;)
First thing is that the projects are hugely different in size. This causes a fundamental queuing problem: with n threads, and more than n huge tasks in the queue, the threads will all end up doing those. (we recently saw a number of days in which it was working on enwiki, frwiki, dewiki, and jawiki and nothing else). This can be fixed with a thread that is restricted to smaller tasks. Like in a market or a bank, with an express lane. (My bank has one teller only for deposits and withdrawals in 500s and 1000s notes, no other transactions.)
Seems reasonable.
Some observations then:
- the main articles dump is a subset of all pages. The latter might usefully
be only a dump of all the pages *not* in the first.
- alternatively, the process could dump the first, then copy the file and
continue with the others for the second (yes, one has to be careful with the bz2 compression state)
If you mean what i think you mean, it won't work.
- or it could write both at the same time, saving the DB access time if not
the compression time
There's one snapshot of the DB for the articles content. All the metadata is extracted at one point (the stub-* files). Then it is filled with content from last full dump and getting new revisions from db.
- the all-history dump might be only the 7z. Yes, it takes longer than the
bz2, but direct to 7z will be much less total time.
- alternatively, write both bz2 and 7z at the same time (if we must have the
bz2, but I don't see why; methinks anyone would want the 7z)
AFAIK the 7z is reading the bz2. It's much easier to recompress on a different format than recreating the xml. Plus it's much less load on the db servers.
- make the all-history dump(s) separate tasks, in separate queue(s); without
them the rest will go very well
That could work. But note that the difference with metadata such as templatelinks will be even greater.
Note that the all-history dumps are cumulative: each contains everything that was in the previous, plus all the new versions. We might reconsider whether we want those at all, or make each an incremental. (I'm not sure what these are for exactly)
So you would need all dumps since January (the first full, then incremental) to get the status at August? It may be better or worse depending on what you'll do with the data.
A dump that is taken over a several month period is also hardly a snapshot, from a DB integrity POV it is nearly useless. But no matter.
See above. The history dump reflects the status at the beginning. You're getting through a month the contents on the history. There is a difference with the additional metadata, such as template and image usage. Not easy to fix if you wanted to, because even if you dumped them in the same transaction as the revision table, it will contain outdated information to be updated by the job queue.
- the format of the all-history dump could be changed to store only
differences (going backward from current) in each XML record
Has been proposed before for the db store. It was determined that there was little difference with just compressing. Moreover, it would make the process slower, as you would also need to diff the revisions. The worst case would be a history merge, where there're new intermediate revisions, so you need to recover the full contents of each revision (from db/undiffing the last dump) and diff it again.
- or a variant of the 7z compressor used that knows where to search for the
matching strings, rather than a general search; it would then be *much* faster. (as it is an LZ77-class method, this doesn't change the decompressor logic)
Could work. Are you volunteering to write it?
Either of these last two would make the all-history dumps at least a couple of orders of magnitude faster.
best regards, Robert