Switching dump format from SQL to XML offers a posibility of very easy incremental dumps.
I did some tests on Polish Wikiquote database dumps (most wikis have only a single xml dump, pl.q has two).
=== Raw data ===
Full dumps, compressed: 20050713_pages_current.xml.gz 1 105 966 20050814_pages_current.xml.gz 1 187 754 20050713_pages_full.xml.gz 3 796 056 20050814_pages_full.xml.gz 4 213 036
Full dumps, uncompressed: 20050713_pages_current.xml 3 874 225 20050814_pages_current.xml 4 158 544 20050713_pages_full.xml 30 882 758 20050814_pages_full.xml 33 708 595
Diffs -slow, hardly compresses anything: current.diff 524 191 current.diff.gz 134 817 (11.5%) full.diff 28 371 190 full.diff.gz 3 859 489 (91.6%) - I suspect the horrible result is due to the data being reordered
Xdelta (xdeltas are automatically gzipped): current.xdelta 92 074 (8.3%) full.xdelta 163 110 (4.2%)
gzip and xdelta were called with zlib compression level parameter -9
$ time xdelta delta -9 20050713_pages_current.xml.gz 20050814_pages_current.xml.gz current.delta real 0m1.023s user 0m0.548s sys 0m0.124s $ time xdelta delta -9 20050713_pages_full.xml 20050814_pages_full.xml full.delta real 0m3.749s user 0m1.731s sys 0m0.217s
=== Conclusion ===
We should simply generate xdeltas every time a dump is generated. It cuts the amount of data that needs to be transferred, easy to add to the backup script, extremely fast, available everywhere, and standard enough.
I doubt other compression methods (DTD-aware XML-deltas or whatever) will be significantly better than that, and if they are, it's certainly going to cost a lot more effort than simply using xdelta.
wikitech-l@lists.wikimedia.org