Re: [Xmldatadumps-l] Compressing full-history dumps faster

20 Jan 2014

      ...
Wow! Thanks for continuing work on this.
Thanks!
This brings up something I forgot to mention: this is a different codebase
from https://github.com/twotwotwo/dltp, what I last wrote you guys about.
dltp tries to handle a lot of different use cases with pages-articles and
incr dumps and diffs; histzip is a much shorter program just focused on
full-history compression.
1) Real life tests with the whole stack of all wikis' dumps to have a
...
complete representation of the effects: it seems you have enough
computational power on your own, but on a Labs instance you could read from
NFS a copy of the dumps to convert.
That would be great--the downloading is slow.
2) Testing if/how the binaries work on the cluster machines
I think I can make a compatible build somehow, but this is a good
point--off the top, does anyone know what distro/kernel one would need the
a binary to run on, or where to find that info?
Well, to be useful this would need to replace one of the existing passages,
...
but common formats like bz2 and 7z are unlikely to be abandoned (on the
other hand people got used to the chunked dumps).
My working theory was that bzip2 is the format that will never be abandoned
because it's the one that most users already have a decompressor for.
There's a case for flipping that around and dropping bzip2 instead of 7zip,
though; more below.
Either way, I think it's reasonable to say that you'll post history dumps
in two formats, one that's widely supported (bzip2 or 7z), and a second
format (histzip) that's not ubiquitous but is a smaller download and/or can
be posted earlier each month since it's fast to create.
Finally, as you allude to, WMF has made changes now and again (even
experimented with a binary dump DB format), and the full-history-dump-using
community is probably relatively select and highly motivated (you don't
unpack 10TB on a whim), so maybe switching out one of the decompressors
just wouldn't be a huge deal.
In an ideal world you'd be able to achieve the same results without
...
requiring your program on decompression, "just" by smartly rearranging the
input fed to 7z/bzip2.
You can't. bzip2 loses its context every ~900KB of input, so (on its own)
it'll never get ideal performance on rev chains more than that length. And
rather than having an LZ-like algorithm that looks for repetitions, it has
a block-sorting algorithm where repetitions can even hurt compression speed
(search the bzip2 manpage for "repeats").
7-zip *is* good at finding repeats over long distances, which leads to its
great compression ratios. It's just slower, especially on the default
settings. You actually get a good speed boost using the 7za option -mx=3,
which uses a 4MB history buffer (as histzip currently does, though that's
tunable). It's 3x slower than histzip|bzip in my tests and the files are
20-30% larger I think, but it's still faster and smaller than plain bzip on
this content by a long margin. If 7-zip files made with -mx=3 are "standard
enough" to be the format-of-record, that would provide better speeds and
ratios than bzip.
The source XML files are already arranged pretty well--long stretches of
revisions of the same article are together. histzip isn't rearranging or
parsing XML or doing anything "wiki-aware"--it's just using an algorithm
tuned for the long repeats within an n-MB range that often occur in change
histories.
On Mon, Jan 20, 2014 at 4:02 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
...
Randall Farmer, 20/01/2014 23:39:
...
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with
the same avg compression ratio as 7zip.  [...]
Wow! Thanks for continuing work on this.
Technical datadaump aside: *How could I get this more thoroughly tested,
...
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?
It depends what sort of tests you have in mind.

Real life tests with the whole stack of all wikis' dumps to have a

complete representation of the effects: it seems you have enough
computational power on your own, but on a Labs instance you could read from
NFS a copy of the dumps to convert.
2) Testing if/how the binaries work on the cluster machines: you should be
able to take the puppet config of the snapshot hosts and just run your
tests with the same config. (Again, this is what Labs is for.)
3) Testing the whole dumps process with the new system: I don't think
there is a duplicate infrastructure to test stuff on, so this is unlikely.
You'd have to contribute puppet changes and get them deployed by Ariel,
presumably first adding the new format without replacing any and just with
some wikis/workers.

Who do I

...
talk to to get started? (I'd dealt with Ariel Glenn before, but haven't
seen activity from Ariel lately, and in any case maybe playing with a
new tool falls under Labs or some other heading than dumps devops.)
Ops other than Ariel would probably also involved in some way. In
particular, the utility should AFAIK become a debian package, whether
available in the official repos or just in WMF's. I don't understand if
this is the relevant doc: https://wikitech.wikimedia.
org/wiki/Git-buildpackage
Am I
...
nuts to be even asking about this? Are there things that would
definitely need to change for integration to be possible? Basically, I'm
trying to get this from a tech demo to something with real-world utility.
Well, to be useful this would need to replace one of the existing
passages, but common formats like bz2 and 7z are unlikely to be abandoned
(on the other hand people got used to the chunked dumps). In an ideal world
you'd be able to achieve the same results without requiring your program on
decompression, "just" by smartly rearranging the input fed to 7z/bzip2.
Most of the repetition is about identical or almost identical revisions and
I've no idea in what order they appear in the XML; bzip2 is careful but
myopic.
Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Compressing full-history dumps faster