[Wikitech-l] RFC: Incremental history dumps

19 Oct 2007

Lars Aronsson wrote:
...
  Or is it already done this way, behind the scenes,
only that it 
 isn't visible from the outside? 
No.

AFAIK it is done as follows:

Precondition: The last full dump (if not present, treat as empty).
1- Take an snapshot of the wiki status (page table?) and create 
stub-meta-history
2- Read stub-meta-history and fill the page content with the last dump 
page contents. If a page content is not on previous dump, get it from 
the external storage in a blocking way.

Result: A bzipped2 full history dump.
The bzip2 dump is then uncompressed and 7zipped.

If there's an error on a call to the external storage, the process can't 
be resumed and the dump fails.

I had been recently thinking on it and think it could be done as this:
Precondition: The last full dump (if not present, treat as empty) and 
its greatest revid.
1a- Take an snapshot of the wiki status (page table?) and create 
stub-meta-history
1b- While reading the revisions, if revid is greater than the 
lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions).
2-Run N processes grabbing these page contents. Store them on a 
new-format dump (the external storage equivalent), one per revid list 
file. If one fails, just rerun it.

3-  Read stub-meta-history and fill the page content with the last dump 
page contents. If a page text is not on previous dump, grab from the 
list file if revid > LDGR else, get it from the external storage saving 
it on a different file.

Revisions not present on last dump nor incremental dumps will occur on 
restored pages, and still be able to block it, but being much less, it's 
much more unlikely that they fail.

4-Save the new dump LDGR with the new bzipped dump.

Making available the M+1 incremental dumps, using the smaller 
meta-stubs-history, last dump can be recreated using the previous one 
(=less download size).

Wikimedia would still provide the full dumps, but you would only be need 
ed the first time.

Comments?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] RFC: Incremental history dumps