On Tue, Jan 18, 2011 at 7:21 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
Wikimedia doesn't technically use delta compression. It concatenates a couple dozen adjacent revisions of the same page and compresses that (with gzip?), achieving very good compression ratios because there is a huge amount of duplication in, say, 20 adjacent revisions of [[Barack Obama]] (small changes to a large page, probably a few identical versions to due vandalism reverts, etc.).
We used to do this, but the problem was that many articles are much larger than the compression window of typical compression algorithms, so the redundancy between adjacent revisions wasn't helping compression except for short articles. Tim wrote a diff-based history storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and deployed it on Wikimedia, for 93% space savings:
http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html
Why isn't this being used for the dumps?