I attached a Python script that compresses/decompresses files in 10MB chunks, and stores info about block boundaries so you can read random parts of the file. It's set up to use rzip or xdelta3 (a different package from xdelta) for compression, so you'll want one or both. It's public domain, with no warranty. No docs either; the command-line syntax tries to be like gzip's.
Some sample invocations of the script with timing and output-size info are below my signature.
The list of caveats could be about as long as the script--performance, hackability/readability, ease of installation, and flexibility are all suboptimal. At a minimum, it's not safe for exchanging files without at least 1) making the way it reads/writes binary numbers CPU-architecture-independent (Python's array.array('l').tostring() is not), 2) adding a magic number, file format version, and compression-type tag, so format/algorithm can be upgraded gracefully, 3) better handling errors, like the ones you get when rzip or xdelta3 are not installed, 4) testing. The -blksdev filename suffix it uses reflects that it's a development format, not production-ready.
Last post I said xdelta3 and rzip compressed histories pretty well pretty quickly, but didn't expand (ha!) on that at all. Both programs have have a first stage that quickly compresses long repetitions like you'll see in history files, at the cost of completely missing short-range redundancy. Then, rzip uses bzip2 and xdelta3 can use its own compressor for short-range redundancy. Neither adds much value if your file doesn't have large long-range repetitions, which is why you don't hear about them as general-purpose compressors often.
Honestly don't know if anything down this path will suit your needs. Certainly this exact script doesn't--just seemed like an interesting thing to mess around with.
Best, Randall
The original file is enwiki-20130304-pages-meta-history22.xml-p018183504p018225000, 4.1G of history goodness.
File sizes and compression times-- 50M rz-blks 2m17s 76M xd3djw-blks 1m47s 89M xd3-blks 1m38s *'s are for the two options I find most interesting. Note you can stream data to/from blks.py, even though you can't stream to/from rzip.
For comparison, times and sizes without blocks: 39M 7z 16m21s 39M rz 1m12s (1m5s slower and 11M bigger with blocks) 80M xd3 1m (38s slower and 9M bigger with blocks)
Specific command lines and timings: # Compress using rzip # -f = force overwrite of any existing file # -k = keep original $ time ~/blks.py -fk enwiki-20130304-pages-meta-history22.xml-p018183504p018225000
real 2m17.508s user 0m58.584s sys 0m55.983s
# Compress using xdelta3 -S djw # -p picks your compressor time ~/blks.py -pxd3djw -fk enwiki-20130304-pages-meta-history22.xml-p018183504p018225000
real 1m47.528s user 0m39.402s sys 0m44.051s
# Get some content crossing a 10M block boundary and make sure it matches the original # --skip and --length say what part to read # -d means to decompress # -c means write to stdout $ time ~/blks.py -dc --skip=9900000 --length=200000 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000.rz-blks | md5sum 02122d6b4dc678ca138b680edd3b0067 - real 0m0.358s $ head --bytes=10100000 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 | tail --bytes=200000 | md5sum 02122d6b4dc678ca138b680edd3b0067 -
# Decompress everything to stdout, and md5sum $ time ~/blks.py -dc enwiki-20130304-pages-meta-history22.xml-p018183504p018225000.rz-blks | md5sum 10fa39684af636b55c4cb1649359ead5 - real 1m22.579s $ time md5sum enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 10fa39684af636b55c4cb1649359ead5 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 real 0m36.323s