Recompression results

Re: [Wikitech-l]...

Tim Starling

16 Mar 2010 16 Mar '10

5:43 p.m.

About 40% of our text storage has been recompressed into DiffHistoryBlob format, which uses a combination of binary diffs and gzip to reduce storage space.

Approximately 1.9TB of text storage, mostly revisions compressed individually with gzip, was recompressed to about 140GB, a saving of 93%.

-- Tim Starling

Show replies by date

Tomasz Finc

16 Mar 16 Mar

8:18 p.m.

Tim Starling wrote:

...

About 40% of our text storage has been recompressed into DiffHistoryBlob format, which uses a combination of binary diffs and gzip to reduce storage space.

Approximately 1.9TB of text storage, mostly revisions compressed individually with gzip, was recompressed to about 140GB, a saving of 93%.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Many thanks to Tim for making this happen.

This has been super helpful in making the XML snapshots run faster.

Is the re-compression in an automated enough state to do the next chunks on its own? Curious to see if you have to do all the shepherding for this.

--tomasz

Tim Starling

10:20 p.m.

Tomasz Finc wrote:

...

Is the re-compression in an automated enough state to do the next chunks on its own? Curious to see if you have to do all the shepherding for this.

There's still some need for human involvement.

Also, there are some potential traps even for humans. There are continuing issues from bug 20757 and bug 22624, and as we recompress more recent clusters, we will run into bugs caused by extensions that interact directly with external storage, such as CodeReview, FlaggedRevs and AbuseFilter.

Any extension that follows the example from CodeReview and implements its own private text table will be a serious problem for RCT and will cause bit rot and data loss.

I added some protections for known bugs in trackBlobs.php in trunk. If you run the trunk version of trackBlobs.php on Wikimedia at the moment, it will just exit with an error (for good reason).

-- Tim Starling

Platonides

17 Mar 17 Mar

1:51 p.m.

You mention on bug 22624 the possibility of normalising the entire archive table to MW 1.5+ format. What are the chances of moving them back to revision and use revdelete for all deletions (removing archive table)?

See bugs 18104, 21279, 18780.

Tim Starling

18 Mar 18 Mar

2 a.m.

Platonides wrote:

...

You mention on bug 22624 the possibility of normalising the entire archive table to MW 1.5+ format. What are the chances of moving them back to revision and use revdelete for all deletions (removing archive table)?

See bugs 18104, 21279, 18780.

Can you copy that question to the bug report please? I don't want to deal with it right now.

-- Tim Starling

Thomas Dalton

16 Mar 16 Mar

8:23 p.m.

On 16 March 2010 21:43, Tim Starling tstarling@wikimedia.org wrote:

...

About 40% of our text storage has been recompressed into DiffHistoryBlob format, which uses a combination of binary diffs and gzip to reduce storage space.

Approximately 1.9TB of text storage, mostly revisions compressed individually with gzip, was recompressed to about 140GB, a saving of 93%.

Revisions were compressed individually? I thought they were concatenated and then compressed to take advantage of revisions of the same article usually only differing by small amounts (and so being highly compressible). I'm sure brion said that sometime...

Aryeh Gregor

8:34 p.m.

On Tue, Mar 16, 2010 at 8:23 PM, Thomas Dalton thomas.dalton@gmail.com wrote:

...

Revisions were compressed individually? I thought they were concatenated and then compressed to take advantage of revisions of the same article usually only differing by small amounts (and so being highly compressible). I'm sure brion said that sometime...

My recollection is that this was the case, but it didn't help much, because articles are typically bigger than the block size used by gzip.

Tim Starling

9:18 p.m.

Aryeh Gregor wrote:

...

On Tue, Mar 16, 2010 at 8:23 PM, Thomas Dalton thomas.dalton@gmail.com wrote:

...
Revisions were compressed individually? I thought they were concatenated and then compressed to take advantage of revisions of the same article usually only differing by small amounts (and so being highly compressible). I'm sure brion said that sometime...

My recollection is that this was the case, but it didn't help much, because articles are typically bigger than the block size used by gzip.

That compression scheme was called CGZ. It helped quite a lot, saving 85% or so compared to uncompressed plain text, IIRC. But the script used to do that compression (compressOld.php) was not compatible with $wgDefaultExternalStore, so it hasn't been run since 2005. Also it was single-threaded so it would have taken a very long time to complete.

The new compression script (recompressTracked.php) works with $wgDefaultExternalStore and various other storage type subtleties. It copies all text from a given set of source clusters to a single destination cluster, allowing the original clusters to be deleted. This is handy from a sysadmin perspective.

Also, recompressTracked.php is scaled up in various ways: it runs multiple worker processes in parallel, it's restartable, and it uses transactions to guarantee data integrity even if other processes are updating the same rows at the same time, or if the worker process is killed at any time.

The maximum dictionary size for gzip is 32KB. It was easy to see that the compression ratio in the CGZ scheme worsened dramatically once the article size exceeded 32KB, because subsequent revisions were no longer able to reference text in previous revisions. We have a lot more articles over 32KB in Wikipedia today, so the compression ratio would not have been as good as it was back in 2005.

The DiffHistoryBlob project was interesting, and achieved awesome compression ratios compared to CGZ. But it was relatively straightforward. Most of the work to make this happen was in the development and operation of trackBlobs/recompressTracked.

-- Tim Starling

5240

Age (days ago)

5242

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Aryeh Gregor
Platonides
Thomas Dalton
Tim Starling
Tomasz Finc