On 01/21/2014 01:23 AM, Randall Farmer wrote:
Anyway, I'm saying too many fundamentally unimportant words. If the status quo re: compression in fact causes enough pain to give histzip a fuller look, or if there's some way to redirect the tech in it towards a useful end, it would be great to hear from interested folks; if not, it was fun work but there may not be much more to do or say.
Efficient compression with large match windows is very interesting for storing history in databases like Cassandra as well. When storing a wikitext dump in Cassandra, gzip with its 32k sliding window yields a db size of about 16-18% of the input text size. This could be much better if repetitions larger than 32k could be caught. With more verbose HTML this is even more important, as more articles will be larger than 32k.
For internal uses tool support is not very important, so a port of histzip / rzip could work well. For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
Gabriel