That does not sound like much economically. Do keep in mind the cost of
porting, deploying, maintaining, obtaining, and so on, new tools.
Briefly, yes, CPU-hours don't cost too much, but I don't think the
potential win is limited to the direct CPU-hours saved.
In more detail: For Wikimedia a quicker-running task is probably easier to
manage, maybe less likely to fail and thus need human attention; dump users
get more up-to-date content if dump processing's quicker; users who get
histzip also get a tool they can (for example) use to quickly pack a
modified XML file through in a pipeline. It's a relatively small
(500-line), hackable tool and could serve as a base for later work: for
instance, I've tried to rig the format so future compressors make
backwards-compatible archives they can insert into without recompressing
all the TBs of input. There are pages on meta going a few years back about
ideas for improving compression speed, and there were past format changes
for operational reasons (chunking full-history dumps) and other
dump-related proposals in Wikimedia-land (a project this past summer about
a new dump tool), so I don't think I'm entirely swatting at gnats by trying
to work up another possible tool.
I'm talking about keeping at least one of the current, widely supported
formats around, which I think would limit hardship for existing users. I'm
sort of curious how many full-history-dump users there are and if they have
anything to say. You mentioned porting; histzip is a Go program that's easy
to cross-compile for different OSes/architectures (as I have for
Windows/Mac/Linux on the github page, though not various BSDs).
I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
he might be interested in having this as part of 7-Zip
as some kind of
"fast" option, and also the developers of the `xz` tools. There might
even be ways this could fit within existing extensibility mechanisms of
the formats.
7-Zip is definitely a very cool and flexible program. I think it can
actually run faster than it's going in the current dumps setup: -mx=3
maintains ratios better than bzip's, but runs faster than bzip. That's a
few times slower than histzip|bzip and slightly larger output, but it's a
boost from the status quo. (There's an argument for maintaining that, not
bzip, as the widely-supported format, which I'd mentioned in the
xmldatadumps-l branch of this thread, or for just changing the 7z settings
and calling it a day)
Interesting to hear from Nemo that Pavlov was interested in long-range
zipping. histzip doesn't have source he could drop into his C program (it's
in Go) and it's really aimed at a narrow niche (long repetitions at a
certain distance) so I doubt I could get it integrated there.
Anyway, I'm saying too many fundamentally unimportant words. If the status
quo re: compression in fact causes enough pain to give histzip a fuller
look, or if there's some way to redirect the tech in it towards a useful
end, it would be great to hear from interested folks; if not, it was fun
work but there may not be much more to do or say.
On Mon, Jan 20, 2014 at 4:49 PM, Bjoern Hoehrmann <derhoermi(a)gmx.net> wrote:
* Randall Farmer wrote:
As I understand, compressing full-history dumps
for English Wikipedia and
other big wikis takes a lot of resources: enwiki history is about 10TB
unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
over a day of server time. There's been talk about ways to speed that up
in
the past.[1]
That does not sound like much economically. Do keep in mind the cost of
porting, deploying, maintaining, obtaining, and so on, new tools. There
might be hundreds of downstream users and if every one of them has to
spend a couple of minutes adopting to a new format, that can quickly
outweigh any savings, as a simple example.
Technical datadaump aside: *How could I get this
more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
activity from Ariel lately, and in any case maybe playing with a new tool
falls under Labs or some other heading than dumps devops.) Am I nuts to be
even asking about this? Are there things that would definitely need to
change for integration to be possible? Basically, I'm trying to get this
from a tech demo to something with real-world utility.
I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
he might be interested in having this as part of 7-Zip as some kind of
"fast" option, and also the developers of the `xz` tools. There might
even be ways this could fit within existing extensibility mechanisms of
the formats. Igor Pavlov tends to be quite response through the
SF.net
bug tracker. In any case, they might be able to give directions how this
might become, or not, part of standard tools.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de ·
http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 ·
http://www.websitedev.de/
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l