Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. Can anyone help me test more or experimentally deploy?
As I understand, compressing full-history dumps for English Wikipedia and other big wikis takes a lot of resources: enwiki history is about 10TB unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's over a day of server time. There's been talk about ways to speed that up in the past.[1]
It turns out that for history dumps in particular, you can compress many times faster if you do a first pass that just trims the long chunks of text that didn't change between revisions. A program called rzip[2] does this (and rzip's _very_ cool, but fatally for us it can't stream input or output). The general approach is sometimes called Bentley-McIlroy compression.[3]
So I wrote something I'm calling histzip.[4] It compresses long repeated sections using a history buffer of a few MB. If you pipe history XML through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're talking an hour or three to pack enwiki on a big box. While it compresses, it also self-tests by unpacking its output and comparing checksums against the original. I've done a couple test runs on last month's fullhist dumps without checksum errors or crashes. Last full run I did, the whole dump compressed to about 1% smaller than 7zip's output; the exact ratios varied file to file (I think it's relatively better at pages with many revisions) but were +/- 10% of 7zip's in general.
Also, less exciting, but histzip's also a reasonably cheap way to get daily incr dumps about 30% smaller.
Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.) Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
Best, Randall
[1] Some past discussion/experiments are captured at http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb3... [2] http://rzip.samba.org/ [3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep... [4] https://github.com/twotwotwo/histzip
Randall Farmer, 20/01/2014 23:39:
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. [...]
Wow! Thanks for continuing work on this.
Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?
It depends what sort of tests you have in mind. 1) Real life tests with the whole stack of all wikis' dumps to have a complete representation of the effects: it seems you have enough computational power on your own, but on a Labs instance you could read from NFS a copy of the dumps to convert. 2) Testing if/how the binaries work on the cluster machines: you should be able to take the puppet config of the snapshot hosts and just run your tests with the same config. (Again, this is what Labs is for.) 3) Testing the whole dumps process with the new system: I don't think there is a duplicate infrastructure to test stuff on, so this is unlikely. You'd have to contribute puppet changes and get them deployed by Ariel, presumably first adding the new format without replacing any and just with some wikis/workers.
- Who do I
talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.)
Ops other than Ariel would probably also involved in some way. In particular, the utility should AFAIK become a debian package, whether available in the official repos or just in WMF's. I don't understand if this is the relevant doc: https://wikitech.wikimedia.org/wiki/Git-buildpackage
Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
Well, to be useful this would need to replace one of the existing passages, but common formats like bz2 and 7z are unlikely to be abandoned (on the other hand people got used to the chunked dumps). In an ideal world you'd be able to achieve the same results without requiring your program on decompression, "just" by smartly rearranging the input fed to 7z/bzip2. Most of the repetition is about identical or almost identical revisions and I've no idea in what order they appear in the XML; bzip2 is careful but myopic.
Nemo
Wow! Thanks for continuing work on this.
Thanks!
This brings up something I forgot to mention: this is a different codebase from https://github.com/twotwotwo/dltp, what I last wrote you guys about. dltp tries to handle a lot of different use cases with pages-articles and incr dumps and diffs; histzip is a much shorter program just focused on full-history compression.
1) Real life tests with the whole stack of all wikis' dumps to have a
complete representation of the effects: it seems you have enough computational power on your own, but on a Labs instance you could read from NFS a copy of the dumps to convert.
That would be great--the downloading is slow.
2) Testing if/how the binaries work on the cluster machines
I think I can make a compatible build somehow, but this is a good point--off the top, does anyone know what distro/kernel one would need the a binary to run on, or where to find that info?
Well, to be useful this would need to replace one of the existing passages,
but common formats like bz2 and 7z are unlikely to be abandoned (on the other hand people got used to the chunked dumps).
My working theory was that bzip2 is the format that will never be abandoned because it's the one that most users already have a decompressor for. There's a case for flipping that around and dropping bzip2 instead of 7zip, though; more below.
Either way, I think it's reasonable to say that you'll post history dumps in two formats, one that's widely supported (bzip2 or 7z), and a second format (histzip) that's not ubiquitous but is a smaller download and/or can be posted earlier each month since it's fast to create.
Finally, as you allude to, WMF has made changes now and again (even experimented with a binary dump DB format), and the full-history-dump-using community is probably relatively select and highly motivated (you don't unpack 10TB on a whim), so maybe switching out one of the decompressors just wouldn't be a huge deal.
In an ideal world you'd be able to achieve the same results without
requiring your program on decompression, "just" by smartly rearranging the input fed to 7z/bzip2.
You can't. bzip2 loses its context every ~900KB of input, so (on its own) it'll never get ideal performance on rev chains more than that length. And rather than having an LZ-like algorithm that looks for repetitions, it has a block-sorting algorithm where repetitions can even hurt compression speed (search the bzip2 manpage for "repeats").
7-zip *is* good at finding repeats over long distances, which leads to its great compression ratios. It's just slower, especially on the default settings. You actually get a good speed boost using the 7za option -mx=3, which uses a 4MB history buffer (as histzip currently does, though that's tunable). It's 3x slower than histzip|bzip in my tests and the files are 20-30% larger I think, but it's still faster and smaller than plain bzip on this content by a long margin. If 7-zip files made with -mx=3 are "standard enough" to be the format-of-record, that would provide better speeds and ratios than bzip.
The source XML files are already arranged pretty well--long stretches of revisions of the same article are together. histzip isn't rearranging or parsing XML or doing anything "wiki-aware"--it's just using an algorithm tuned for the long repeats within an n-MB range that often occur in change histories.
On Mon, Jan 20, 2014 at 4:02 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 20/01/2014 23:39:
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. [...]
Wow! Thanks for continuing work on this.
Technical datadaump aside: *How could I get this more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?
It depends what sort of tests you have in mind.
- Real life tests with the whole stack of all wikis' dumps to have a
complete representation of the effects: it seems you have enough computational power on your own, but on a Labs instance you could read from NFS a copy of the dumps to convert. 2) Testing if/how the binaries work on the cluster machines: you should be able to take the puppet config of the snapshot hosts and just run your tests with the same config. (Again, this is what Labs is for.) 3) Testing the whole dumps process with the new system: I don't think there is a duplicate infrastructure to test stuff on, so this is unlikely. You'd have to contribute puppet changes and get them deployed by Ariel, presumably first adding the new format without replacing any and just with some wikis/workers.
- Who do I
talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.)
Ops other than Ariel would probably also involved in some way. In particular, the utility should AFAIK become a debian package, whether available in the official repos or just in WMF's. I don't understand if this is the relevant doc: https://wikitech.wikimedia. org/wiki/Git-buildpackage
Am I
nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
Well, to be useful this would need to replace one of the existing passages, but common formats like bz2 and 7z are unlikely to be abandoned (on the other hand people got used to the chunked dumps). In an ideal world you'd be able to achieve the same results without requiring your program on decompression, "just" by smartly rearranging the input fed to 7z/bzip2. Most of the repetition is about identical or almost identical revisions and I've no idea in what order they appear in the XML; bzip2 is careful but myopic.
Nemo
* Randall Farmer wrote:
As I understand, compressing full-history dumps for English Wikipedia and other big wikis takes a lot of resources: enwiki history is about 10TB unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's over a day of server time. There's been talk about ways to speed that up in the past.[1]
That does not sound like much economically. Do keep in mind the cost of porting, deploying, maintaining, obtaining, and so on, new tools. There might be hundreds of downstream users and if every one of them has to spend a couple of minutes adopting to a new format, that can quickly outweigh any savings, as a simple example.
Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.) Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
I would definitely recommend talking to Igor Pavlov (7-Zip) about this, he might be interested in having this as part of 7-Zip as some kind of "fast" option, and also the developers of the `xz` tools. There might even be ways this could fit within existing extensibility mechanisms of the formats. Igor Pavlov tends to be quite response through the SF.net bug tracker. In any case, they might be able to give directions how this might become, or not, part of standard tools.
That does not sound like much economically. Do keep in mind the cost of porting, deploying, maintaining, obtaining, and so on, new tools.
Briefly, yes, CPU-hours don't cost too much, but I don't think the potential win is limited to the direct CPU-hours saved.
In more detail: For Wikimedia a quicker-running task is probably easier to manage, maybe less likely to fail and thus need human attention; dump users get more up-to-date content if dump processing's quicker; users who get histzip also get a tool they can (for example) use to quickly pack a modified XML file through in a pipeline. It's a relatively small (500-line), hackable tool and could serve as a base for later work: for instance, I've tried to rig the format so future compressors make backwards-compatible archives they can insert into without recompressing all the TBs of input. There are pages on meta going a few years back about ideas for improving compression speed, and there were past format changes for operational reasons (chunking full-history dumps) and other dump-related proposals in Wikimedia-land (a project this past summer about a new dump tool), so I don't think I'm entirely swatting at gnats by trying to work up another possible tool.
I'm talking about keeping at least one of the current, widely supported formats around, which I think would limit hardship for existing users. I'm sort of curious how many full-history-dump users there are and if they have anything to say. You mentioned porting; histzip is a Go program that's easy to cross-compile for different OSes/architectures (as I have for Windows/Mac/Linux on the github page, though not various BSDs).
I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
he might be interested in having this as part of 7-Zip as some kind of "fast" option, and also the developers of the `xz` tools. There might even be ways this could fit within existing extensibility mechanisms of the formats.
7-Zip is definitely a very cool and flexible program. I think it can actually run faster than it's going in the current dumps setup: -mx=3 maintains ratios better than bzip's, but runs faster than bzip. That's a few times slower than histzip|bzip and slightly larger output, but it's a boost from the status quo. (There's an argument for maintaining that, not bzip, as the widely-supported format, which I'd mentioned in the xmldatadumps-l branch of this thread, or for just changing the 7z settings and calling it a day)
Interesting to hear from Nemo that Pavlov was interested in long-range zipping. histzip doesn't have source he could drop into his C program (it's in Go) and it's really aimed at a narrow niche (long repetitions at a certain distance) so I doubt I could get it integrated there.
Anyway, I'm saying too many fundamentally unimportant words. If the status quo re: compression in fact causes enough pain to give histzip a fuller look, or if there's some way to redirect the tech in it towards a useful end, it would be great to hear from interested folks; if not, it was fun work but there may not be much more to do or say.
On Mon, Jan 20, 2014 at 4:49 PM, Bjoern Hoehrmann derhoermi@gmx.net wrote:
- Randall Farmer wrote:
As I understand, compressing full-history dumps for English Wikipedia and other big wikis takes a lot of resources: enwiki history is about 10TB unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's over a day of server time. There's been talk about ways to speed that up
in
the past.[1]
That does not sound like much economically. Do keep in mind the cost of porting, deploying, maintaining, obtaining, and so on, new tools. There might be hundreds of downstream users and if every one of them has to spend a couple of minutes adopting to a new format, that can quickly outweigh any savings, as a simple example.
Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.) Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
I would definitely recommend talking to Igor Pavlov (7-Zip) about this, he might be interested in having this as part of 7-Zip as some kind of "fast" option, and also the developers of the `xz` tools. There might even be ways this could fit within existing extensibility mechanisms of the formats. Igor Pavlov tends to be quite response through the SF.net bug tracker. In any case, they might be able to give directions how this might become, or not, part of standard tools. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org