New subject: RFC: Incremental history dumps

19 Oct 2007

In the recent weeks I have been following the database dumps of 
some languages of Wikipedia.  I download and analyze a dump, do 
various improvements, and then wait for the next dump to become 
available for a new analysis.  There are 2 or 3 weeks between each 
dump.  There appear to be two parallel dump processes continuously 
running, http://download.wikimedia.org/backup-index.html

What takes most time in each dump is the large file with complete 
version history, pages-meta-history.xml.bz2 and 
pages-meta-history.xml.7z

This is the largest file in compressed format, but since it 
contains every version of every article it is also very highly 
compressed, and expands to become enormous.  I guess that very few 
people find use for this file.  In addition, only a very small 
portion of its contents is changed between two dumps.  So we spend 
a lot of time and effort (and delay of other things) in order to 
create very little for very few users.

I think that this dump should be made incremental.  Every week, 
only that week's additional versions need to be dumped.  This can 
then be added to the dump of the previous week, the week before 
that, etc., which hasn't really changed.  This way, the dump 
process could be made much faster, and the two parallel dump 
processes would complete the cycle in less time, so new dumps of 
the same project could be made available more frequently.

Or is it already done this way, behind the scenes, only that it 
isn't visible from the outside?

-- 
  Lars Aronsson (lars(a)aronsson.se)
  Aronsson Datateknik - http://aronsson.se