Reviving an old thread--this is at a really early stage, but a new
(de)compressor by Google named Brotli could someday be useful for packing
history dumps:
Google released it after W3C ran into trouble trying to standardize a
super-compressed font format around LZMA. Brotli's based on flate, with
low-level encoding tune-ups and more effort spent predicting likely next
bytes based on context, and (notably for history dumps) its history window
is 4MB, not 64KB. There's a (draft) spec, and there'll be very well-vetted
library code in Chrome.
Two unknowns are 1) whether there will be a fast compressor (don't know the
compression speed, and it just wasn't a priority for the Web-font use case)
and 2) whether brotli will get a standard framing format and widely-used
tools in the way that flate has gzip. But thought it was interesting enough
to pass on even if it's too early to guess about all that right now.
On Sat, Mar 8, 2014 at 1:14 PM, Randall Farmer <randall(a)wawd.com> wrote:
I see you got
more pointers there. :) Did you manage to explore them?
The blocker is that I didn't hear much interest from dump folks in a
non-7z archive format even if it boosted compression speed a lot. Of the
packers Bulat replied with (zpaq, exdupe, pcompress, his own srep), exdupe
and srep explicitly promise fast deduping and give numbers, so they'd be
the most obvious to look at if you've got a use case. Here are the crib
notes on how I'd look at them:
Long-range compressors are often set up to find kilobytes-long repeats
over 100MB+ distances. rzip and lrzip are like that, for example. That's
because that's what you need for, e.g., deduping copies of the same large
file across a backup. But when you support a much longer window than you
need, you pay with some combination of RAM use, inability to stream
input/output (because you need random reads from the history if it doesn't
fit in RAM), or compression ratio (because you miss shorter matches).
That's why the original rzip wasn't an ideal drop-in for 7zip for Wiki
full-history dumps, though it did very well on benchmarks that used small
pieces of a dump.
So the things I'd look at re: other long-range compressors are whether
they can stream input/output (and so fit in existing dump/load flows) and
whether they do well fed an actual many-GBs chunk of dump (in output size,
ratio, and RAM/CPU use). Of course, you might have flexibility on some of
those axes, e.g., if you have no problem dropping input/output streaming.
Hope this helps,
Randall
On Sat, Mar 8, 2014 at 6:53 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>wrote;wrote:
Randall Farmer, 21/01/2014 23:26:
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be
inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
I see you got more pointers there. :) Did you manage to explore them?
Nemo