[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Lev Muchnik levmuchnik at gmail.com
Tue Mar 16 22:36:11 UTC 2010


 LZMA SDK  <http://www.7-zip.org/sdk.html>, provides a C-style API.  The
only problem I find is that it requires pooling - recurrent calls to extract
pieces of of data. So, I wrapped it with a C++ stream which I feed
to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3
days to process all languages except English) .

On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken <jmorken at shaw.ca> wrote:

>
> Hi,
>
> Is this code available to process the 7zip data on the fly?  I had heard a
> rumour before that 7zip required multiple passes to decompress.
>
> cheers,
> Jamie
>
>
>
> ----- Original Message -----
> From: Lev Muchnik <levmuchnik at gmail.com>
> Date: Tuesday, March 16, 2010 1:55 pm
> Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
> Checksumming pages-meta-history.xml.bz2 :D
> To: Tomasz Finc <tfinc at wikimedia.org>
> Cc: Wikimedia developers <wikitech-l at lists.wikimedia.org>,
> xmldatadumps-admin-l at lists.wikimedia.org,
> Xmldatadumps-l at lists.wikimedia.org
>
> > I am entirely for 7z. In fact, once released, I'll be able to
> > test the XML
> > integrity right away - I process the data on the fly,
> > without  unpacking it
> > first.
> >
> >
> > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
> > <tfinc at wikimedia.org> wrote:
> >
> > > Kevin Webb wrote:
> > > > I just managed to finish decompression. That took about 54
> > hours on an
> > > > EC2 2.5x unit CPU. The final data size is 5469GB.
> > > >
> > > > As the process just finished I haven't been able to check the
> > > > integrity of the XML, however, the bzip stream itself
> > appears to be
> > > > good.
> > > >
> > > > As was mentioned previously, it would be great if you could
> > compress> > future archives using pbzib to allow for parallel
> > decompression. As I
> > > > understand it, the pbzip files are reverse compatible with all
> > > > existing bzip2 utilities.
> > >
> > > Looks like the trade off is slightly larger files due to pbzip2's
> > > algorithm for individual chunking. We'd have to change the
> > >
> > > buildFilters function in http://tinyurl.com/yjun6n5 and
> > install the new
> > > binary. Ubuntu already has it in 8.04 LTS making it easy.
> > >
> > > Any takers for the change?
> > >
> > > I'd also like to gauge everyones opinion on moving away from
> > the large
> > > file sizes of bz2 and going exclusively 7z. We'd save a huge
> > amount of
> > > space doing it at a slightly larger cost during compression.
> > > Decompression of 7z these days is wicked fast.
> > >
> > > let know
> > >
> > > --tomasz
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Xmldatadumps-admin-l mailing list
> > > Xmldatadumps-admin-l at lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
> > admin-l
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20100316/ee9b12a6/attachment.htm 


More information about the Xmldatadumps-l mailing list