On Wed, Jan 7, 2009 at 8:31 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Wed, Jan 7, 2009 at 4:43 PM, Robert Rohde rarohde@gmail.com wrote:
reduction in size (11.1 GB). Because it is still a text based format, it stacks well with traditional file compressors (bz2: 89% reduction - 1.24 GB; 7z: 91% reduction - 1.07 GB).
Ruwiki dumps currently show: pages-meta-history.xml.7z 1.3 GB
Not really all that much of a win post 7z-ing considering the current performance numbers you mentioned. (No doubt your code could be made faster... but at the same time 7z is not the state of the art in raw compression ratio)
Not that your format wouldn't have many uses... but it doesn't appear to offer significant gains for bulk transport. (in the future it would be helpful if you cited the current compressed size when comparing new compressed sizes)
Yes, you are right about that. For bulk transport and storage it is not a big improvement.
However, to work with ruwiki, for example, one generally needs to decompress it to the full 170 GB. To work with enwiki's full revision history, if such a dump is ever to exist again, would probably decompress to ~2 TB. 7z and bz2 are not great formats if one wants to extract only portions of the dump since there are few tools that would allow one to do so without first reinflating the whole file. Hence, one of the advantages I see in my format is being able to have a dump that is still <10% the full inflated size while also being able to parse out selected articles or selected revisions in a straightforward manner.
-Robert Rohde