Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome.

If I had one more (very late) comment, it would be to glance at existing libraries for building blocks.

For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level options like SQLite are widely used. LevelDB is pretty cool too.

For delta coding, there's xdelta3, open-vcdiff, and Git's delta code. (rzip/rsync are wicked awesome, but not as easy to just drop in as a library.) 

Some of the libraries are already used by a lot of projects, which helps make their formats a sort of de-facto standard--i.e., we'll need tools to read BDB and SQLite for a long time. That might help slightly with concerns some folks raised about relying on a new binary format for long-term archival. BDB 4, GDBM, and SQLite3 modules are part of the Python distribution, for example, and xdelta3 reads/writes VCDIFF (a format also used by Chrome) and is supposed to be buildable as a module too.

Hope this helps.

Randall


On Wed, Jul 10, 2013 at 6:34 AM, Petr Onderka <gsvick@gmail.com> wrote:
On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer <randall@wawd.com> wrote:
> Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.

First, glad to see there's motion here.

It's definitely true that recompressing the entire history to .bz2 or .7z goes very, very slowly. Also, I don't know of an existing tool that lets you just insert new data here and there without compressing all of the unchanged data as well. Those point towards some sort of format change.

I'm not sure a new format has to be sparse or indexed to get around those two big problems.

For full-history dumps, delta coding (or the related idea of long-range redundancy compression) runs faster than bzip2 or 7z and produces good compression ratios on full-history dumps, based on some tests(I'm going to focus mostly on full-history dumps here because they're the hard case and one Ariel said is currently painful--not everything here will apply to latest-revs dumps.)

For inserting data, you do seemingly need to break the file up into independently-compressed sections containing just one page's revision history or a fragment of it, so you can add new diff(s) to a page's revision history without decompressing and recompressing the previous revisions. (Removing previously-dumped revisions is another story, but it's rarer.) You'd be in new territory just doing that; I don't know of existing compression tools that really allow that.

You could do those two things, though, while still keeping full-history dumps a once-every-so-often batch process that produces a sorted file. The time to rewrite the file, stripped of the big compression steps, could be bearable--a disk can read or write about 100 MB/s, so just copying the 70G of the .7z enwiki dumps is well under an hour; if the part bound by CPU and other steps is smallish, you're OK.

A format like the proposed one, with revisions inserted wherever there's free space when they come in, will also eventually fragment the revision history for one page (I think Ariel alluded to this in some early notes). Unlike sequential read/writes, seeks are something HDDs are sadly pretty slow at (hence the excitement about solid-state disks); if thousands of revisions are coming in a day, it eventually becomes slow to read things in the old page/revision order, and you need fancy techniques to defrag (maybe a big external-memory sort) or you need to only read the dump on fast hardware that can handle the seeks. Doing occasional batch jobs that produce sorted files could help avoid the fragmentation question.

These are some interesting ideas.

You're right that the copying the whole dump is fast enough (it would probably add about an hour to a process that currently takes several days). But it would also pretty much force the use of delta compression. And while I would like to use delta compression, I don't think it's a good idea to be forced to use it, because I might not have the time for it or it might not be good enough.

Because of that, I decided to stay with my indexed approach.
 
There's a great quote about the difficulty of "constructing a software design...to make it so simple that there are obviously no deficiencies." (Wikiquote came through with the full text/attribution, of course.) I admit it's tricky and people can disagree about what's simple enough or even what approach is simpler of two choices, but it's something to strive for.

Anyway, I'm wary about going into the technical weeds of other folks' projects, because, hey, it's your project! I'm trying to map out the options in the hope that you could get a product you're happier with and maybe give you more time in a tight three-month schedule to improve on your work and not just complete it. Whatever you do, good luck and I'm interested to see the results!

Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome.

Petr Onderka