[Wikitech-l] Re: Differential storage

12 Sep 2005

Netocrat wrote:
...
  On Mon, 12 Sep 2005 01:56:53 -0700, Brion Vibber
wrote:
 See Tim's presentation from 21C3:
http://zwinger.wikimedia.org/berlin/  

 That's exactly the sort of info I was looking for.  Was any attempt made
 to compress the diffs?  I would be interested to know how the result
 compared for compression and overall speed to the compressed concatenated
 revisions. 
No, no work has been done along these lines that I'm aware of.

...
  The three main reasons to find an improvement to rcs
diffs were stated as:
 * moved paragraphs
 * reverted edits
 * minor changes within a line

 The 1st and 3rd could be handled by a customised diff format and the 2nd
 could be handled by links in the database - have those possibilities been
 considered and what pros/cons are there to this approach vs the current
 compression scheme? 
No, these possibilities have not been rigorously examined. Note that
those aren't really reasons. They're illustrative only, addressing each
of them in turn does not guarantee that your compression algorithm is
effective. I was just describing my train of thought in arriving at the
idea that LZ77 might be worth a try. If you have your own idea, please
download a dump and try it out.

The main thing which put me off implementing diff-based compression was
the complexity, in particular the required schema change. If you need to
load some large number of diffs in order to generate a revision, those
diffs need to be loaded in a single database query, if any kind of
efficiency is to be reached.

In other words, don't do a proof of principle and then nag me to write
the real thing, as if that were the easy part.

Since that talk, we've addressed the scalability issue by implementing
external storage, allowing us to store text on the terabytes of apache
hard drive space which were previously unused. Because of this, we're
less concerned about size now, and more about performance and
manageability. We'd like to have faster backups and much simpler
administration. Effective use of the existing compression and external
storage features has been hampered by high system administration
overhead. Any new storage proposal needs to be evaluated in this context.

...
  The disadvantage to the current compression scheme
seems to me to be that
 the wiki software must work on the full text of a set of revisions at a
 time (i.e. when uncompressed). 
The advantage is that when a number of adjacent revisions are required
(such as during a backup), those revisions can be loaded quickly with a
minimum of seeking.

-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Re: Differential storage