On 15 December 2010 20:41, Anthony <wikimail(a)inbox.org> wrote:
On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar
<hippytrail(a)gmail.com> wrote:
By the way I'm keen to find something similar
for .7z
I've written something similar for .xz, which uses LZMA2 same as .7z.
It creates a virtual read-only filesystem using FUSE (the FUSE part is
in perl, which uses pipes to dd and xzcat). Only real problem is that
it doesn't use a stock .xz file, it uses a specially created one which
concatenates lots of smaller .xz files (currently I concatenate
between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5
and 20 because there's a preference to split on </page><page>
boundaries).
At the moment I'm interested in .bz2 and .7z because those are the
formats WikiMedia currently publishes data in. Though some files are
also in .gz so I would also like to find a solution for those.
I thought about the concatenation solution splitting at <page>
boundaries for .bz2 until I found out there was already a solution
that worked with the vanilla dump files as is.
Apparently the folks at openzim have done something
similar, using LZMA2.
If anyone is interesting in working with me to make a package capable
of being released to the public, I'd be willing to share my code. But
it sounds like I'm just reinventing a wheel already invented by
opensim.
I'm interested in what everybody else is doing regarding offline
WikiMedia content. I'm also mainly using Perl though I just ran into a
problem with 64-bit values when indexing huge dump files.
It would be
incredibly useful if these indices could be created as
part of the dump creation process. Should I file a feature request?
With concatenated .xz files, creating the index is *much* faster,
because the .xz format puts the stream size at the end of each stream.
Plus with .xz all streams are broken on 4-byte boundaries, whereas
with .bz2 blocks can end at any *bit* (which means you have to do
painful bit shifting to create the index).
The file is also *much* smaller, on the order of 5-10% of bzip2 for a
full history dump.
Have we made the case for this format to the WikiMedia people? I think
they use .bz2 because it is pretty fast for very good compression
ratios but they use .7z for the full history dumps where the extremely
good compression ratios warrant the slower compression times since
these files can be gigantic.
How is .xz for compression times? Would we have to worry about patent
issues for LZMA?
Andrew Dunbar (hippietrail)
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l