Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?

2 Aug 2008

      Robert Ullmann wrote:
...
It is good that we will have new disks and it likely won't get stuck; but
that doesn't address the primary problem of the length of time these things
take. Let me try to be more constructive.
It just paralellizes it ;)
...
First thing is that the projects are hugely different in size. This causes a
fundamental queuing problem: with n threads, and more than n huge tasks in
the queue, the threads will all end up doing those. (we recently saw a
number of days in which it was working on enwiki, frwiki, dewiki, and jawiki
and nothing else). This can be fixed with a thread that is restricted to
smaller tasks. Like in a market or a bank, with an express lane. (My bank
has one teller only for deposits and withdrawals in 500s and 1000s notes, no
other transactions.)
Seems reasonable.
...
Some observations then:

the main articles dump is a subset of all pages. The latter might usefully

be only a dump of all the pages *not* in the first.

alternatively, the process could dump the first, then copy the file and

continue with the others for the second (yes, one has to be careful with the
bz2 compression state)
If you mean what i think you mean, it won't work.
...

or it could write both at the same time, saving the DB access time if not

the compression time
There's one snapshot of the DB for the articles content. All the 
metadata is extracted at one point (the stub-* files). Then it is filled 
with content from last full dump and getting new revisions from db.
...

the all-history dump might be only the 7z. Yes, it takes longer than the

bz2, but direct to 7z will be much less total time.

alternatively, write both bz2 and 7z at the same time (if we must have the

bz2, but I don't see why; methinks anyone would want the 7z)
AFAIK the 7z is reading the bz2. It's much easier to recompress on a 
different format than recreating the xml. Plus it's much less load on 
the db servers.
...

make the all-history dump(s) separate tasks, in separate queue(s); without

them the rest will go very well
That could work. But note that the difference with metadata such as 
templatelinks will be even greater.
...
Note that the all-history dumps are cumulative: each contains everything
that was in the previous, plus all the new versions. We might reconsider
whether we want those at all, or make each an incremental. (I'm not sure
what these are for exactly)
So you would need all dumps since January (the first full, then 
incremental) to get the status at August?
It may be better or worse depending on what you'll do with the data.
...
A dump that is taken over a several month
period is also hardly a snapshot, from a DB integrity POV it is nearly
useless. But no matter.
See above. The history dump reflects the status at the beginning. You're 
getting through a month the contents on the history.
There is a difference with the additional metadata, such as template and 
image usage. Not easy to fix if you wanted to, because even if you 
dumped them in the same transaction as the revision table, it will 
contain outdated information to be updated by the job queue.
...

the format of the all-history dump could be changed to store only

differences (going backward from current) in each XML record
Has been proposed before for the db store. It was determined that there 
was little difference with just compressing.
Moreover, it would make the process slower, as you would also need to 
diff the revisions. The worst case would be a history merge, where 
there're new intermediate revisions, so you need to recover the full 
contents of each revision (from db/undiffing the last dump) and diff it 
again.
...

or a variant of the 7z compressor used that knows where to search for the

matching strings, rather than a general search; it would then be *much*
faster. (as it is an LZ77-class method, this doesn't change the decompressor
logic)
Could work. Are you volunteering to write it?
...
Either of these last two would make the all-history dumps at least a couple
of orders of magnitude faster.
best regards,
Robert

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?