For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level
options like SQLite are widely used.
LevelDB<https://code.google.com/p/leveldb/>
is
pretty cool too.
I think that with the amount of data we're dealing with, it makes sense to
have the file format under tight control. For example, saving a single byte
on each revision means total savings of ~500 MB for enwiki.
In any case, at this point it would be more work to switch to one of those
than to keep using the format I created.
For delta coding, there's xdelta3
<http://xdelta.org/>,
open-vcdiff<https://code.google.com/p/open-vcdiff/>f/>,
and
Git's<http://stackoverflow.com/questions/9478023/is-the-git-binary-d…
delta <https://github.com/git/git/blob/master/diff-delta.c>
code<https://github.com/git/git/blob/master/patch-delta.c>.c>.
(rzip <http://rzip.samba.org/>/rsync are wicked awesome, but not as easy
to just drop in as a library.)
I'm certainly going to try to use some library for delta compression,
because they seem to do pretty much exactly what's needed here. Thanks for
the suggestions.
Petr Onderka