For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level options like SQLite are widely used. LevelDBhttps://code.google.com/p/leveldb/ is pretty cool too.
I think that with the amount of data we're dealing with, it makes sense to have the file format under tight control. For example, saving a single byte on each revision means total savings of ~500 MB for enwiki.
In any case, at this point it would be more work to switch to one of those than to keep using the format I created.
For delta coding, there's xdelta3 http://xdelta.org/, open-vcdiffhttps://code.google.com/p/open-vcdiff/, and Git'shttp://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized delta https://github.com/git/git/blob/master/diff-delta.c codehttps://github.com/git/git/blob/master/patch-delta.c. (rzip http://rzip.samba.org//rsync are wicked awesome, but not as easy to just drop in as a library.)
I'm certainly going to try to use some library for delta compression, because they seem to do pretty much exactly what's needed here. Thanks for the suggestions.
Petr Onderka
wikitech-l@lists.wikimedia.org