Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?

2 Aug 2008

Robert Ullmann wrote:
...
  It is good that we will have new disks and it likely
won't get stuck; but
 that doesn't address the primary problem of the length of time these things
 take. Let me try to be more constructive. 
It just paralellizes it ;)

...
  First thing is that the projects are hugely different
in size. This causes a
 fundamental queuing problem: with n threads, and more than n huge tasks in
 the queue, the threads will all end up doing those. (we recently saw a
 number of days in which it was working on enwiki, frwiki, dewiki, and jawiki
 and nothing else). This can be fixed with a thread that is restricted to
 smaller tasks. Like in a market or a bank, with an express lane. (My bank
 has one teller only for deposits and withdrawals in 500s and 1000s notes, no
 other transactions.) 
Seems reasonable.

...
  Some observations then:

 * the main articles dump is a subset of all pages. The latter might usefully
 be only a dump of all the pages *not* in the first.
 * alternatively, the process could dump the first, then copy the file and
 continue with the others for the second (yes, one has to be careful with the
 bz2 compression state) If you mean what i think you mean, it won't work.

...
  * or it could write both at the same time, saving the
DB access time if not
 the compression time There's one snapshot of the DB for the articles content.
All the 
metadata is extracted at one point (the stub-* files). Then it is filled 
with content from last full dump and getting new revisions from db.

...
  * the all-history dump might be only the 7z. Yes, it
takes longer than the
 bz2, but direct to 7z will be much less total time.
 * alternatively, write both bz2 and 7z at the same time (if we must have the
 bz2, but I don't see why; methinks anyone would want the 7z) AFAIK the 7z is
reading the bz2. It's much easier to recompress on a 
different format than recreating the xml. Plus it's much less load on 
the db servers.

...
  * make the all-history dump(s) separate tasks, in
separate queue(s); without
 them the rest will go very well That could work. But note that the difference with
metadata such as 
templatelinks will be even greater.

...
  Note that the all-history dumps are cumulative: each
contains everything
 that was in the previous, plus all the new versions. We might reconsider
 whether we want those at all, or make each an incremental. (I'm not sure
 what these are for exactly)  So you would need all dumps since January (the first
full, then 
incremental) to get the status at August?
It may be better or worse depending on what you'll do with the data.

...
  A dump that is taken over a several month
 period is also hardly a snapshot, from a DB integrity POV it is nearly
 useless. But no matter. See above. The history dump reflects the status at the
beginning. You're 
getting through a month the contents on the history.
There is a difference with the additional metadata, such as template and 
image usage. Not easy to fix if you wanted to, because even if you 
dumped them in the same transaction as the revision table, it will 
contain outdated information to be updated by the job queue.

...
  * the format of the all-history dump could be changed
to store only
 differences (going backward from current) in each XML record Has been proposed
before for the db store. It was determined that there 
was little difference with just compressing.
Moreover, it would make the process slower, as you would also need to 
diff the revisions. The worst case would be a history merge, where 
there're new intermediate revisions, so you need to recover the full 
contents of each revision (from db/undiffing the last dump) and diff it 
again.

...
  * or a variant of the 7z compressor used that knows
where to search for the
 matching strings, rather than a general search; it would then be *much*
 faster. (as it is an LZ77-class method, this doesn't change the decompressor
 logic) 
Could work. Are you volunteering to write it?

...
  Either of these last two would make the all-history
dumps at least a couple
 of orders of magnitude faster.

 best regards,
 Robert 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?