On Tue, Jan 18, 2011 at 7:21 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw
<roan.kattouw(a)gmail.com> wrote:
Wikimedia doesn't technically use delta
compression. It concatenates a
couple dozen adjacent revisions of the same page and compresses that
(with gzip?), achieving very good compression ratios because there is
a huge amount of duplication in, say, 20 adjacent revisions of
[[Barack Obama]] (small changes to a large page, probably a few
identical versions to due vandalism reverts, etc.).
We used to do this, but the problem was that many articles are much
larger than the compression window of typical compression algorithms,
so the redundancy between adjacent revisions wasn't helping
compression except for short articles. Tim wrote a diff-based history
storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and
deployed it on Wikimedia, for 93% space savings:
http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html