Robert Ullmann wrote:
It is good that we will have new disks and it likely
won't get stuck; but
that doesn't address the primary problem of the length of time these things
take. Let me try to be more constructive.
It just paralellizes it ;)
First thing is that the projects are hugely different
in size. This causes a
fundamental queuing problem: with n threads, and more than n huge tasks in
the queue, the threads will all end up doing those. (we recently saw a
number of days in which it was working on enwiki, frwiki, dewiki, and jawiki
and nothing else). This can be fixed with a thread that is restricted to
smaller tasks. Like in a market or a bank, with an express lane. (My bank
has one teller only for deposits and withdrawals in 500s and 1000s notes, no
other transactions.)
Seems reasonable.
Some observations then:
* the main articles dump is a subset of all pages. The latter might usefully
be only a dump of all the pages *not* in the first.
* alternatively, the process could dump the first, then copy the file and
continue with the others for the second (yes, one has to be careful with the
bz2 compression state)
If you mean what i think you mean, it won't work.
* or it could write both at the same time, saving the
DB access time if not
the compression time
There's one snapshot of the DB for the articles content.
All the
metadata is extracted at one point (the stub-* files). Then it is filled
with content from last full dump and getting new revisions from db.
* the all-history dump might be only the 7z. Yes, it
takes longer than the
bz2, but direct to 7z will be much less total time.
* alternatively, write both bz2 and 7z at the same time (if we must have the
bz2, but I don't see why; methinks anyone would want the 7z)
AFAIK the 7z is
reading the bz2. It's much easier to recompress on a
different format than recreating the xml. Plus it's much less load on
the db servers.
* make the all-history dump(s) separate tasks, in
separate queue(s); without
them the rest will go very well
That could work. But note that the difference with
metadata such as
templatelinks will be even greater.
Note that the all-history dumps are cumulative: each
contains everything
that was in the previous, plus all the new versions. We might reconsider
whether we want those at all, or make each an incremental. (I'm not sure
what these are for exactly)
So you would need all dumps since January (the first
full, then
incremental) to get the status at August?
It may be better or worse depending on what you'll do with the data.
A dump that is taken over a several month
period is also hardly a snapshot, from a DB integrity POV it is nearly
useless. But no matter.
See above. The history dump reflects the status at the
beginning. You're
getting through a month the contents on the history.
There is a difference with the additional metadata, such as template and
image usage. Not easy to fix if you wanted to, because even if you
dumped them in the same transaction as the revision table, it will
contain outdated information to be updated by the job queue.
* the format of the all-history dump could be changed
to store only
differences (going backward from current) in each XML record
Has been proposed
before for the db store. It was determined that there
was little difference with just compressing.
Moreover, it would make the process slower, as you would also need to
diff the revisions. The worst case would be a history merge, where
there're new intermediate revisions, so you need to recover the full
contents of each revision (from db/undiffing the last dump) and diff it
again.
* or a variant of the 7z compressor used that knows
where to search for the
matching strings, rather than a general search; it would then be *much*
faster. (as it is an LZ77-class method, this doesn't change the decompressor
logic)
Could work. Are you volunteering to write it?
Either of these last two would make the all-history
dumps at least a couple
of orders of magnitude faster.
best regards,
Robert