On 15 December 2010 20:41, Anthony wikimail@inbox.org wrote:
On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar hippytrail@gmail.com wrote:
By the way I'm keen to find something similar for .7z
I've written something similar for .xz, which uses LZMA2 same as .7z. It creates a virtual read-only filesystem using FUSE (the FUSE part is in perl, which uses pipes to dd and xzcat). Only real problem is that it doesn't use a stock .xz file, it uses a specially created one which concatenates lots of smaller .xz files (currently I concatenate between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5 and 20 because there's a preference to split on </page><page> boundaries).
At the moment I'm interested in .bz2 and .7z because those are the formats WikiMedia currently publishes data in. Though some files are also in .gz so I would also like to find a solution for those.
I thought about the concatenation solution splitting at <page> boundaries for .bz2 until I found out there was already a solution that worked with the vanilla dump files as is.
Apparently the folks at openzim have done something similar, using LZMA2.
If anyone is interesting in working with me to make a package capable of being released to the public, I'd be willing to share my code. But it sounds like I'm just reinventing a wheel already invented by opensim.
I'm interested in what everybody else is doing regarding offline WikiMedia content. I'm also mainly using Perl though I just ran into a problem with 64-bit values when indexing huge dump files.
It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request?
With concatenated .xz files, creating the index is *much* faster, because the .xz format puts the stream size at the end of each stream. Plus with .xz all streams are broken on 4-byte boundaries, whereas with .bz2 blocks can end at any *bit* (which means you have to do painful bit shifting to create the index).
The file is also *much* smaller, on the order of 5-10% of bzip2 for a full history dump.
Have we made the case for this format to the WikiMedia people? I think they use .bz2 because it is pretty fast for very good compression ratios but they use .7z for the full history dumps where the extremely good compression ratios warrant the slower compression times since these files can be gigantic.
How is .xz for compression times? Would we have to worry about patent issues for LZMA?
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l