Aryeh Gregor wrote:
On Tue, Mar 16, 2010 at 8:23 PM, Thomas Dalton
<thomas.dalton(a)gmail.com> wrote:
Revisions were compressed individually? I thought
they were
concatenated and then compressed to take advantage of revisions of the
same article usually only differing by small amounts (and so being
highly compressible). I'm sure brion said that sometime...
My recollection is that this was the case, but it didn't help much,
because articles are typically bigger than the block size used by
gzip.
That compression scheme was called CGZ. It helped quite a lot, saving
85% or so compared to uncompressed plain text, IIRC. But the script
used to do that compression (compressOld.php) was not compatible with
$wgDefaultExternalStore, so it hasn't been run since 2005. Also it
was single-threaded so it would have taken a very long time to complete.
The new compression script (recompressTracked.php) works with
$wgDefaultExternalStore and various other storage type subtleties. It
copies all text from a given set of source clusters to a single
destination cluster, allowing the original clusters to be deleted.
This is handy from a sysadmin perspective.
Also, recompressTracked.php is scaled up in various ways: it runs
multiple worker processes in parallel, it's restartable, and it uses
transactions to guarantee data integrity even if other processes are
updating the same rows at the same time, or if the worker process is
killed at any time.
The maximum dictionary size for gzip is 32KB. It was easy to see that
the compression ratio in the CGZ scheme worsened dramatically once the
article size exceeded 32KB, because subsequent revisions were no
longer able to reference text in previous revisions. We have a lot
more articles over 32KB in Wikipedia today, so the compression ratio
would not have been as good as it was back in 2005.
The DiffHistoryBlob project was interesting, and achieved awesome
compression ratios compared to CGZ. But it was relatively
straightforward. Most of the work to make this happen was in the
development and operation of trackBlobs/recompressTracked.
-- Tim Starling