On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.).
We used to do this, but the problem was that many articles are much larger than the compression window of typical compression algorithms, so the redundancy between adjacent revisions wasn't helping compression except for short articles. Tim wrote a diff-based history storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and deployed it on Wikimedia, for 93% space savings:
http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html
I don't know if this was ever deployed to all of external storage, though. In that thread Tim mentioned only recompressing about 40% of revisions, and said that the recompression script required care and human attention to work correctly, so maybe he never got around to recompressing all the rest -- I don't think he ever said, that I saw.