Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.
If I had one more (very late) comment, it would be to glance at existing
libraries for building blocks.
For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level
options like SQLite are widely used.
LevelDB<https://code.google.com/p/leveldb/> is
pretty cool too.
For delta coding, there's xdelta3 <http://xdelta.org/>,
open-vcdiff<https://code.google.com/p/open-vcdiff/>f/>,
and
Git's<http://stackoverflow.com/questions/9478023/is-the-git-binary-d…
delta <https://github.com/git/git/blob/master/diff-delta.c>
code<https://github.com/git/git/blob/master/patch-delta.c>.c>.
(rzip <http://rzip.samba.org/>/rsync are wicked awesome, but not as easy to
just drop in as a library.)
Some of the libraries are already used by a lot of projects, which helps
make their formats a sort of de-facto standard--i.e., we'll need tools to
read BDB and SQLite for a long time. That might help slightly with concerns
some folks raised about relying on a new binary format for long-term
archival. BDB 4, GDBM, and SQLite3 modules are part of the Python
distribution, for example, and xdelta3 reads/writes VCDIFF (a
format<http://tools.ietf.org/html/rfc3284>
also
used<http://en.wikipedia.org/wiki/Shared_Dictionary_Compression_Over_HTT…
by
Chrome) and is supposed to be
buildable<https://code.google.com/p/xdelta/wiki/LanguageInterface>as a
module too.
Hope this helps.
Randall
On Wed, Jul 10, 2013 at 6:34 AM, Petr Onderka <gsvick(a)gmail.com> wrote:
> On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer <randall(a)wawd.com> wrote:
>
>> > Keeping the dumps in a text-based format doesn't make sense, because
>> that can't be updated efficiently, which is the whole reason for the new
>> dumps.
>>
>> First, glad to see there's motion here.
>>
>> It's definitely true that recompressing the entire history to .bz2 or .7z
>> goes very, very slowly. Also, I don't know of an existing tool that lets
>> you just insert new data here and there without compressing all of the
>> unchanged data as well. Those point towards some sort of format change.
>>
>> I'm not sure a new format has to be sparse or indexed to get around those
>> two big problems.
>>
>> For full-history dumps, delta coding (or the related idea of long-range
>> redundancy compression) runs faster than bzip2 or 7z and produces good
>> compression ratios on full-history dumps, based on some
tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3>
>> . (I'm going to focus mostly on full-history dumps here because they're
>> the hard case and one Ariel said is currently painful--not everything here
>> will apply to latest-revs dumps.)
>>
>> For inserting data, you do seemingly need to break the file up into
>> independently-compressed sections containing just one page's revision
>> history or a fragment of it, so you can add new diff(s) to a page's
>> revision history without decompressing and recompressing the previous
>> revisions. (Removing previously-dumped revisions is another story, but it's
>> rarer.) You'd be in new territory just doing that; I don't know of
existing
>> compression tools that really allow that.
>>
>> You could do those two things, though, while still keeping full-history
>> dumps a once-every-so-often batch process that produces a sorted file. The
>> time to rewrite the file, stripped of the big compression steps, could be
>> bearable--a disk can read or write about 100 MB/s, so just copying the 70G
>> of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
>> other steps is smallish, you're OK.
>>
>> A format like the proposed one, with revisions inserted wherever there's
>> free space when they come in, will also eventually fragment the revision
>> history for one page (I think Ariel alluded to this in some early notes).
>> Unlike sequential read/writes, seeks are something HDDs are sadly pretty
>> slow at (hence the excitement about solid-state disks); if thousands of
>> revisions are coming in a day, it eventually becomes slow to read things in
>> the old page/revision order, and you need fancy techniques to defrag (maybe
>> a big external-memory
sort<http://en.wikipedia.org/wiki/External_sorting>)
>> or you need to only read the dump on fast hardware that can handle the
>> seeks. Doing occasional batch jobs that produce sorted files could help
>> avoid the fragmentation question.
>>
>
> These are some interesting ideas.
>
> You're right that the copying the whole dump is fast enough (it would
> probably add about an hour to a process that currently takes several days).
> But it would also pretty much force the use of delta compression. And while
> I would like to use delta compression, I don't think it's a good idea to be
> forced to use it, because I might not have the time for it or it might not
> be good enough.
>
> Because of that, I decided to stay with my indexed approach.
>
>
>> There's a great quote about the difficulty of "constructing a software
>> design...to make it so simple that there are obviously no deficiencies."
>> (Wikiquote came through with the full text/attribution, of
course<http://en.wikiquote.org/wiki/C._A._R._Hoare>re>.)
>> I admit it's tricky and people can disagree about what's simple enough
or
>> even what approach is simpler of two choices, but it's something to strive
>> for.
>>
>> Anyway, I'm wary about going into the technical weeds of other folks'
>> projects, because, hey, it's your project! I'm trying to map out the
>> options in the hope that you could get a product you're happier with and
>> maybe give you more time in a tight three-month schedule to improve on your
>> work and not just complete it. Whatever you do, good luck and I'm
>> interested to see the results!
>>
Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.
>
> Petr Onderka
>