Switching dump format from SQL to XML offers a posibility of
very easy incremental dumps.
I did some tests on Polish Wikiquote database dumps (most
wikis have only a single xml dump, pl.q has two).
=== Raw data ===
Full dumps, compressed:
20050713_pages_current.xml.gz 1 105 966
20050814_pages_current.xml.gz 1 187 754
20050713_pages_full.xml.gz 3 796 056
20050814_pages_full.xml.gz 4 213 036
Full dumps, uncompressed:
20050713_pages_current.xml 3 874 225
20050814_pages_current.xml 4 158 544
20050713_pages_full.xml 30 882 758
20050814_pages_full.xml 33 708 595
Diffs -slow, hardly compresses anything:
current.diff 524 191
current.diff.gz 134 817 (11.5%)
full.diff 28 371 190
full.diff.gz 3 859 489 (91.6%) - I suspect the horrible result is due to
the data being reordered
Xdelta (xdeltas are automatically gzipped):
current.xdelta 92 074 (8.3%)
full.xdelta 163 110 (4.2%)
gzip and xdelta were called with zlib compression level parameter -9
$ time xdelta delta -9 20050713_pages_current.xml.gz 20050814_pages_current.xml.gz
current.delta
real 0m1.023s
user 0m0.548s
sys 0m0.124s
$ time xdelta delta -9 20050713_pages_full.xml 20050814_pages_full.xml full.delta
real 0m3.749s
user 0m1.731s
sys 0m0.217s
=== Conclusion ===
We should simply generate xdeltas every time a dump is generated.
It cuts the amount of data that needs to be transferred,
easy to add to the backup script, extremely fast, available everywhere,
and standard enough.
I doubt other compression methods (DTD-aware XML-deltas or whatever)
will be significantly better than that, and if they are, it's
certainly going to cost a lot more effort than simply using xdelta.
Show replies by date