Re: [Xmldatadumps-l] Compressing full-history dumps faster

21 Jan 2014

      Randall Farmer, 20/01/2014 23:39:
...
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with
the same avg compression ratio as 7zip.  [...]
Wow! Thanks for continuing work on this.
...
Technical datadaump aside: *How could I get this more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?
It depends what sort of tests you have in mind.
1) Real life tests with the whole stack of all wikis' dumps to have a 
complete representation of the effects: it seems you have enough 
computational power on your own, but on a Labs instance you could read 
from NFS a copy of the dumps to convert.
2) Testing if/how the binaries work on the cluster machines: you should 
be able to take the puppet config of the snapshot hosts and just run 
your tests with the same config. (Again, this is what Labs is for.)
3) Testing the whole dumps process with the new system: I don't think 
there is a duplicate infrastructure to test stuff on, so this is 
unlikely. You'd have to contribute puppet changes and get them deployed 
by Ariel, presumably first adding the new format without replacing any 
and just with some wikis/workers.
...

Who do I

talk to to get started? (I'd dealt with Ariel Glenn before, but haven't
seen activity from Ariel lately, and in any case maybe playing with a
new tool falls under Labs or some other heading than dumps devops.)
Ops other than Ariel would probably also involved in some way. In 
particular, the utility should AFAIK become a debian package, whether 
available in the official repos or just in WMF's. I don't understand if 
this is the relevant doc: 
https://wikitech.wikimedia.org/wiki/Git-buildpackage
...
Am I
nuts to be even asking about this? Are there things that would
definitely need to change for integration to be possible? Basically, I'm
trying to get this from a tech demo to something with real-world utility.
Well, to be useful this would need to replace one of the existing 
passages, but common formats like bz2 and 7z are unlikely to be 
abandoned (on the other hand people got used to the chunked dumps). In 
an ideal world you'd be able to achieve the same results without 
requiring your program on decompression, "just" by smartly rearranging 
the input fed to 7z/bzip2. Most of the repetition is about identical or 
almost identical revisions and I've no idea in what order they appear in 
the XML; bzip2 is careful but myopic.
Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Compressing full-history dumps faster