you may want to drop it on a Wikimedia repo
This is done (
https://gerrit.wikimedia.org/r/#/c/63139/). Besides the rzip
script, there's another one that does a simple dedupe of lines of text
repeated between revs, then gzips. It's slower than rzip at any given
compression level, but still faster and smaller than straight bzip (in a
test on 4G of enwiki, 50% smaller and 10x faster) and anyone with Python
can run it.
Again, if there's any lesson it's just that there are some gains from
making even pretty naive attempts to compress redundancy between revisions.
Interested to see what the summer dumps project produces.
I've also expanded the braindump about this stuff on the dbzip2 page you
linked to.
On Mon, May 6, 2013 at 9:02 AM, Randall Farmer <randall(a)wawd.com> wrote:
Sure.
On Mon, May 6, 2013 at 12:06 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>wrote;wrote:
Randall Farmer, 06/05/2013 08:37:
To wrap up what I started earlier, here's a
slightly tweaked copy of the
last script I sent around [...] But, all that said, declaring blks2.py
a (kinda
fun to work on!) dead end. :)
If you're done with it, you may want to drop it on a Wikimedia repo like <
https://gerrit.wikimedia.org/**r/gitweb?p=operations/dumps.**
git;a=tree;f=toys;h=**0974d59e573fd5bceb76ec93878471**bc11f6430c;hb=**
119d99131f2cf692819422ad5e516c**49d935a504<https://gerrit.wikimedia.org/…
or whatever, just so that it's not only a mail attachment.
I also copied some short info you sent earlier to
https://www.mediawiki.org/**wiki/Dbzip2#rzip_and_xdelta3<https://www.med…
lack of better existing pages (?).
Nemo