Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster

24 Apr 2014

Reviving an old thread--this is at a really early stage, but a new
(de)compressor by Google named Brotli could someday be useful for packing
history dumps:

https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ3NTv08Qb3n6lZ_q…
forde=id.ge4739a87_10<https://docs.google.com/presentation/d/1aigINmRR7f…
https://code.google.com/p/font-compression-reference/source/browse/#git%2Fb…

Google released it after W3C ran into trouble trying to standardize a
super-compressed font format around LZMA. Brotli's based on flate, with
low-level encoding tune-ups and more effort spent predicting likely next
bytes based on context, and (notably for history dumps) its history window
is 4MB, not 64KB. There's a (draft) spec, and there'll be very well-vetted
library code in Chrome.

Two unknowns are 1) whether there will be a fast compressor (don't know the
compression speed, and it just wasn't a priority for the Web-font use case)
and 2) whether brotli will get a standard framing format and widely-used
tools in the way that flate has gzip. But thought it was interesting enough
to pass on even if it's too early to guess about all that right now.

On Sat, Mar 8, 2014 at 1:14 PM, Randall Farmer &lt;randall(a)wawd.com&gt; wrote:

...
   I see you got
more pointers there. :) Did you manage to explore them? 
 The blocker is that I didn't hear much interest from dump folks in a
 non-7z archive format even if it boosted compression speed a lot. Of the
 packers Bulat replied with (zpaq, exdupe, pcompress, his own srep), exdupe
 and srep explicitly promise fast deduping and give numbers, so they'd be
 the most obvious to look at if you've got a use case. Here are the crib
 notes on how I'd look at them:

 Long-range compressors are often set up to find kilobytes-long repeats
 over 100MB+ distances. rzip and lrzip are like that, for example. That's
 because that's what you need for, e.g., deduping copies of the same large
 file across a backup. But when you support a much longer window than you
 need, you pay with some combination of RAM use, inability to stream
 input/output (because you need random reads from the history if it doesn't
 fit in RAM), or compression ratio (because you miss shorter matches).
 That's why the original rzip wasn't an ideal drop-in for 7zip for Wiki
 full-history dumps, though it did very well on benchmarks that used small
 pieces of a dump.

 So the things I'd look at re: other long-range compressors are whether
 they can stream input/output (and so fit in existing dump/load flows) and
 whether they do well fed an actual many-GBs chunk of dump (in output size,
 ratio, and RAM/CPU use). Of course, you might have flexibility on some of
 those axes, e.g., if you have no problem dropping input/output streaming.

 Hope this helps,
 Randall

 On Sat, Mar 8, 2014 at 6:53 AM, Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt;wrote;wrote:

  Randall Farmer, 21/01/2014 23:26:

  Trying to get quick-and-dirty long-range matching into LZMA isn't
  feasible for me personally and there may be
inherent technical
 difficulties. Still, I left a note on the 7-Zip boards as folks
 suggested; feel free to add anything there:
 https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

 I see you got more pointers there. :) Did you manage to explore them?

 Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster