On Tue, Aug 23, 2011 at 5:35 PM, Brion Vibber brion@pobox.com wrote: <snip>
Broadly speaking some sort of diff storage makes a lot of sense; especially if it doesn't require reproducing those diffs all the time. :)
But be warned that there are different needs and different ways of processing data; diffs again interfere with random access, as you need to be able to fetch adjacent items to reproduce the text. If you're just trundling along through the entire dump and applying diffs as you go to reconstruct the text, then you're basically doing what you already do when doing on-the-fly decompression of the .xml.bz2 or .xml.7z -- it may, or may not, actually save you anything for this case.
Of course if all you really wanted was the diff, then obviously that's going to help you. :)
I've found that diff representations of the full history can knock off about 95% of the uncompressed size. When stacked with generic compressors such as bz2 and 7z, an intelligent differencing scheme can still see improvement such that .diff.7z is about 10-50% smaller than .xml.7z while representing the same content. As you note though, the trade-off is that you have to look at many diffs to reconstruct the page's content. Given that hard disks are cheap, the biggest advantage is probably really for people who want to study diffs as their main object of study.
-Robert Rohde