you may want to drop it on a Wikimedia repo
This is done (https://gerrit.wikimedia.org/r/#/c/63139/). Besides the rzip script, there's another one that does a simple dedupe of lines of text repeated between revs, then gzips. It's slower than rzip at any given compression level, but still faster and smaller than straight bzip (in a test on 4G of enwiki, 50% smaller and 10x faster) and anyone with Python can run it.
Again, if there's any lesson it's just that there are some gains from making even pretty naive attempts to compress redundancy between revisions. Interested to see what the summer dumps project produces.
I've also expanded the braindump about this stuff on the dbzip2 page you linked to.
On Mon, May 6, 2013 at 9:02 AM, Randall Farmer randall@wawd.com wrote:
Sure.
On Mon, May 6, 2013 at 12:06 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 06/05/2013 08:37:
To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda
fun to work on!) dead end. :)
If you're done with it, you may want to drop it on a Wikimedia repo like < https://gerrit.wikimedia.org/**r/gitweb?p=operations/dumps.** git;a=tree;f=toys;h=**0974d59e573fd5bceb76ec93878471**bc11f6430c;hb=** 119d99131f2cf692819422ad5e516c**49d935a504https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys;h=0974d59e573fd5bceb76ec93878471bc11f6430c;hb=119d99131f2cf692819422ad5e516c49d935a504> or whatever, just so that it's not only a mail attachment. I also copied some short info you sent earlier to https://www.mediawiki.org/**wiki/Dbzip2#rzip_and_xdelta3https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3for lack of better existing pages (?).
Nemo