Incremental dumps - Wikitech-l

29 Aug 2005

Switching dump format from SQL to XML offers a posibility of
very easy incremental dumps.

I did some tests on Polish Wikiquote database dumps (most
wikis have only a single xml dump, pl.q has two).

=== Raw data ===

Full dumps, compressed:
20050713_pages_current.xml.gz  1 105 966
20050814_pages_current.xml.gz  1 187 754
20050713_pages_full.xml.gz     3 796 056
20050814_pages_full.xml.gz     4 213 036

Full dumps, uncompressed:
20050713_pages_current.xml     3 874 225
20050814_pages_current.xml     4 158 544
20050713_pages_full.xml       30 882 758
20050814_pages_full.xml       33 708 595

Diffs -slow, hardly compresses anything:
current.diff                     524 191
current.diff.gz                  134 817 (11.5%)
full.diff                     28 371 190
full.diff.gz                   3 859 489 (91.6%) - I suspect the horrible result is due to
the data being reordered

Xdelta (xdeltas are automatically gzipped):
current.xdelta                    92 074 (8.3%)
full.xdelta                      163 110 (4.2%)

gzip and xdelta were called with zlib compression level parameter -9

$ time xdelta delta -9 20050713_pages_current.xml.gz 20050814_pages_current.xml.gz
current.delta
real    0m1.023s
user    0m0.548s
sys     0m0.124s
$ time xdelta delta -9 20050713_pages_full.xml 20050814_pages_full.xml full.delta
real    0m3.749s
user    0m1.731s
sys     0m0.217s

=== Conclusion ===

We should simply generate xdeltas every time a dump is generated.
It cuts the amount of data that needs to be transferred,
easy to add to the backup script, extremely fast, available everywhere,
and standard enough.

I doubt other compression methods (DTD-aware XML-deltas or whatever)
will be significantly better than that, and if they are, it's
certainly going to cost a lot more effort than simply using xdelta.