[Wikitech-l] Re: First attempt at concatenation-based history compression gives 82% compression

30 Oct 2004

Delirium wrote:
...
  Out of curiosity, have you tried testing bzip2? 
It's usually much 
 better than gzip with multi-megabyte text data; for example, source for 
 Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2.  I 
 believe it also uses similarities across files, so concatenation may not 
 be necessary.  It does use much more RAM and execute more slowly than 
 gzip, however. 
Yes, see http://meta.wikimedia.org/wiki/History_compression . Bzip2 had 
a much better compression ratio, but it was 3.3 times slower to 
decompress and 13 times slower to compress. No block size could give it 
anything like the performance of gzip.

Concatenation is still necessary. In the previous test, bzip2 gave 97% 
compression for heavily edited articles, which far exceeds anything 
recorded for individual revisions.

Preliminary testing of a diff method suggests that diffs can give a 
similar compression ratio to bzip2, but with a speed even faster than gzip.

We've generally assumed that performance is the most important thing. 
I'm willing to admit the possibility that extra DB hardware for poorly 
compressed data may turn out to be more expensive than extra apache 
hardware for better compression. But a diff method may allow us to avoid 
that tradeoff altogether.

-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: First attempt at concatenation-based history compression gives 82% compression