Ack, sorry for the (no subject); again in the right thread:
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply, Randall
Randall Farmer, 21/01/2014 23:26:
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
I see you got more pointers there. :) Did you manage to explore them?
Nemo
I see you got more pointers there. :) Did you manage to explore them?
The blocker is that I didn't hear much interest from dump folks in a non-7z archive format even if it boosted compression speed a lot. Of the packers Bulat replied with (zpaq, exdupe, pcompress, his own srep), exdupe and srep explicitly promise fast deduping and give numbers, so they'd be the most obvious to look at if you've got a use case. Here are the crib notes on how I'd look at them:
Long-range compressors are often set up to find kilobytes-long repeats over 100MB+ distances. rzip and lrzip are like that, for example. That's because that's what you need for, e.g., deduping copies of the same large file across a backup. But when you support a much longer window than you need, you pay with some combination of RAM use, inability to stream input/output (because you need random reads from the history if it doesn't fit in RAM), or compression ratio (because you miss shorter matches). That's why the original rzip wasn't an ideal drop-in for 7zip for Wiki full-history dumps, though it did very well on benchmarks that used small pieces of a dump.
So the things I'd look at re: other long-range compressors are whether they can stream input/output (and so fit in existing dump/load flows) and whether they do well fed an actual many-GBs chunk of dump (in output size, ratio, and RAM/CPU use). Of course, you might have flexibility on some of those axes, e.g., if you have no problem dropping input/output streaming.
Hope this helps, Randall
On Sat, Mar 8, 2014 at 6:53 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 21/01/2014 23:26:
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
I see you got more pointers there. :) Did you manage to explore them?
Nemo
Reviving an old thread--this is at a really early stage, but a new (de)compressor by Google named Brotli could someday be useful for packing history dumps:
https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ3NTv08Qb3n6lZ_qv... forde=id.ge4739a87_10https://docs.google.com/presentation/d/1aigINmRR7fw_ml8rz0rJ3NTv08Qb3n6lZ_qvmxo8CzQ/present#slide=id.ge4739a87_10 https://code.google.com/p/font-compression-reference/source/browse/#git%2Fbr...
Google released it after W3C ran into trouble trying to standardize a super-compressed font format around LZMA. Brotli's based on flate, with low-level encoding tune-ups and more effort spent predicting likely next bytes based on context, and (notably for history dumps) its history window is 4MB, not 64KB. There's a (draft) spec, and there'll be very well-vetted library code in Chrome.
Two unknowns are 1) whether there will be a fast compressor (don't know the compression speed, and it just wasn't a priority for the Web-font use case) and 2) whether brotli will get a standard framing format and widely-used tools in the way that flate has gzip. But thought it was interesting enough to pass on even if it's too early to guess about all that right now.
On Sat, Mar 8, 2014 at 1:14 PM, Randall Farmer randall@wawd.com wrote:
I see you got more pointers there. :) Did you manage to explore them?
The blocker is that I didn't hear much interest from dump folks in a non-7z archive format even if it boosted compression speed a lot. Of the packers Bulat replied with (zpaq, exdupe, pcompress, his own srep), exdupe and srep explicitly promise fast deduping and give numbers, so they'd be the most obvious to look at if you've got a use case. Here are the crib notes on how I'd look at them:
Long-range compressors are often set up to find kilobytes-long repeats over 100MB+ distances. rzip and lrzip are like that, for example. That's because that's what you need for, e.g., deduping copies of the same large file across a backup. But when you support a much longer window than you need, you pay with some combination of RAM use, inability to stream input/output (because you need random reads from the history if it doesn't fit in RAM), or compression ratio (because you miss shorter matches). That's why the original rzip wasn't an ideal drop-in for 7zip for Wiki full-history dumps, though it did very well on benchmarks that used small pieces of a dump.
So the things I'd look at re: other long-range compressors are whether they can stream input/output (and so fit in existing dump/load flows) and whether they do well fed an actual many-GBs chunk of dump (in output size, ratio, and RAM/CPU use). Of course, you might have flexibility on some of those axes, e.g., if you have no problem dropping input/output streaming.
Hope this helps, Randall
On Sat, Mar 8, 2014 at 6:53 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 21/01/2014 23:26:
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
I see you got more pointers there. :) Did you manage to explore them?
Nemo
xmldatadumps-l@lists.wikimedia.org