Re: [Wikitech-l] Incremental history dumps

19 Oct 2007

It already works that way on the backend, pretty much.

We can't make the old increments available forever beacuse of things
we ar obligated to discontinue distributing, so incrementals to the
users would not be so useful.

On 10/19/07, Lars Aronsson &lt;lars(a)aronsson.se&gt; wrote:
...

 In the recent weeks I have been following the database dumps of
 some languages of Wikipedia.  I download and analyze a dump, do
 various improvements, and then wait for the next dump to become
 available for a new analysis.  There are 2 or 3 weeks between each
 dump.  There appear to be two parallel dump processes continuously
 running, http://download.wikimedia.org/backup-index.html

 What takes most time in each dump is the large file with complete
 version history, pages-meta-history.xml.bz2 and
 pages-meta-history.xml.7z

 This is the largest file in compressed format, but since it
 contains every version of every article it is also very highly
 compressed, and expands to become enormous.  I guess that very few
 people find use for this file.  In addition, only a very small
 portion of its contents is changed between two dumps.  So we spend
 a lot of time and effort (and delay of other things) in order to
 create very little for very few users.

 I think that this dump should be made incremental.  Every week,
 only that week's additional versions need to be dumped.  This can
 then be added to the dump of the previous week, the week before
 that, etc., which hasn't really changed.  This way, the dump
 process could be made much faster, and the two parallel dump
 processes would complete the cycle in less time, so new dumps of
 the same project could be made available more frequently.

 Or is it already done this way, behind the scenes, only that it
 isn't visible from the outside?

 --
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Incremental history dumps