Lev

On Tue, Mar 16, 2010 at 7:13 PM, Jamie Morken <jmorken@shaw.ca> wrote:

hi,

I wonder how the zim file format: http://www.openzim.org/Main_Page
would compare to the 7-zip file in regards to size and access speed?

cheers,
Jamie

----- Original Message -----
From: Lev Muchnik <levmuchnik@gmail.com>
Date: Tuesday, March 16, 2010 2:36 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

To: Jamie Morken <jmorken@shaw.ca>
Cc: xmldatadumps-l@lists.wikimedia.org

> LZMA SDK <http://www.7-zip.org/sdk.html>,
> provides a C-style API. The
> only problem I find is that it requires pooling - recurrent
> calls to extract
> pieces of of data. So, I wrapped it with a C++ stream which I feed
> to xerces-c SAX XML. SAX is really fun to use. And the speed is
> amazing (3
> days to process all languages except English) .
>
> On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken
> <jmorken@shaw.ca> wrote:
>
> >
> > Hi,
> >
> > Is this code available to process the 7zip data on the
> fly? I had heard a
> > rumour before that 7zip required multiple passes to decompress.
> >
> > cheers,
> > Jamie
> >
> >
> >
> > ----- Original Message -----
> > From: Lev Muchnik <levmuchnik@gmail.com>
> > Date: Tuesday, March 16, 2010 1:55 pm
> > Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
> > Checksumming pages-meta-history.xml.bz2 :D
> > To: Tomasz Finc <tfinc@wikimedia.org>
> > Cc: Wikimedia developers <wikitech-l@lists.wikimedia.org>,
> > xmldatadumps-admin-l@lists.wikimedia.org,
> > Xmldatadumps-l@lists.wikimedia.org
> >
> > > I am entirely for 7z. In fact, once released, I'll be able to
> > > test the XML
> > > integrity right away - I process the data on the fly,
> > > without unpacking it
> > > first.
> > >
> > >
> > > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
> > > <tfinc@wikimedia.org> wrote:
> > >
> > > > Kevin Webb wrote:
> > > > > I just managed to finish decompression. That took about 54
> > > hours on an
> > > > > EC2 2.5x unit CPU. The final data size is 5469GB.
> > > > >
> > > > > As the process just finished I haven't been able to
> check the
> > > > > integrity of the XML, however, the bzip stream itself
> > > appears to be
> > > > > good.
> > > > >
> > > > > As was mentioned previously, it would be great if you could
> > > compress> > future archives using pbzib to allow for parallel
> > > decompression. As I
> > > > > understand it, the pbzip files are reverse compatible
> with all
> > > > > existing bzip2 utilities.
> > > >
> > > > Looks like the trade off is slightly larger files due to
> pbzip2's> > > algorithm for individual chunking. We'd have to
> change the
> > > >
> > > > buildFilters function in http://tinyurl.com/yjun6n5 and
> > > install the new
> > > > binary. Ubuntu already has it in 8.04 LTS making it easy.
> > > >
> > > > Any takers for the change?
> > > >
> > > > I'd also like to gauge everyones opinion on moving away from
> > > the large
> > > > file sizes of bz2 and going exclusively 7z. We'd save a huge
> > > amount of
> > > > space doing it at a slightly larger cost during compression.
> > > > Decompression of 7z these days is wicked fast.
> > > >
> > > > let know
> > > >
> > > > --tomasz
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Xmldatadumps-admin-l mailing list
> > > > Xmldatadumps-admin-l@lists.wikimedia.org
> > > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
> > > admin-l
> > > >
> > >
> >
>