On 01/21/2014 01:23 AM, Randall Farmer wrote:
Anyway, I'm saying too many fundamentally
unimportant words. If the status
quo re: compression in fact causes enough pain to give histzip a fuller
look, or if there's some way to redirect the tech in it towards a useful
end, it would be great to hear from interested folks; if not, it was fun
work but there may not be much more to do or say.
Efficient compression with large match windows is very interesting for
storing history in databases like Cassandra as well. When storing a
wikitext dump in Cassandra, gzip with its 32k sliding window yields a db
size of about 16-18% of the input text size. This could be much better
if repetitions larger than 32k could be caught. With more verbose HTML
this is even more important, as more articles will be larger than 32k.
For internal uses tool support is not very important, so a port of
histzip / rzip could work well. For external uses like XML dumps
integrating the compression strategy into LZMA would however be very
attractive. This would also benefit other users of LZMA compression like
HBase.
Gabriel