Aryeh Gregor wrote:
On Tue, Mar 16, 2010 at 8:23 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
Revisions were compressed individually? I thought they were concatenated and then compressed to take advantage of revisions of the same article usually only differing by small amounts (and so being highly compressible). I'm sure brion said that sometime...
My recollection is that this was the case, but it didn't help much, because articles are typically bigger than the block size used by gzip.
That compression scheme was called CGZ. It helped quite a lot, saving 85% or so compared to uncompressed plain text, IIRC. But the script used to do that compression (compressOld.php) was not compatible with $wgDefaultExternalStore, so it hasn't been run since 2005. Also it was single-threaded so it would have taken a very long time to complete.
The new compression script (recompressTracked.php) works with $wgDefaultExternalStore and various other storage type subtleties. It copies all text from a given set of source clusters to a single destination cluster, allowing the original clusters to be deleted. This is handy from a sysadmin perspective.
Also, recompressTracked.php is scaled up in various ways: it runs multiple worker processes in parallel, it's restartable, and it uses transactions to guarantee data integrity even if other processes are updating the same rows at the same time, or if the worker process is killed at any time.
The maximum dictionary size for gzip is 32KB. It was easy to see that the compression ratio in the CGZ scheme worsened dramatically once the article size exceeded 32KB, because subsequent revisions were no longer able to reference text in previous revisions. We have a lot more articles over 32KB in Wikipedia today, so the compression ratio would not have been as good as it was back in 2005.
The DiffHistoryBlob project was interesting, and achieved awesome compression ratios compared to CGZ. But it was relatively straightforward. Most of the work to make this happen was in the development and operation of trackBlobs/recompressTracked.
-- Tim Starling