Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

21 Jan 2014

On 01/21/2014 01:23 AM, Randall Farmer wrote:
...
  Anyway, I'm saying too many fundamentally
unimportant words. If the status
 quo re: compression in fact causes enough pain to give histzip a fuller
 look, or if there's some way to redirect the tech in it towards a useful
 end, it would be great to hear from interested folks; if not, it was fun
 work but there may not be much more to do or say. 
Efficient compression with large match windows is very interesting for
storing history in databases like Cassandra as well. When storing a
wikitext dump in Cassandra, gzip with its 32k sliding window yields a db
size of about 16-18% of the input text size. This could be much better
if repetitions larger than 32k could be caught. With more verbose HTML
this is even more important, as more articles will be larger than 32k.

For internal uses tool support is not very important, so a port of
histzip / rzip could work well. For external uses like XML dumps
integrating the compression strategy into LZMA would however be very
attractive. This would also benefit other users of LZMA compression like
HBase.

Gabriel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster