--- El mar, 16/3/10, Lev Muchnik <levmuchnik@gmail.com> escribió:

De: Lev Muchnik <levmuchnik@gmail.com>
Asunto: Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Para: "Jamie Morken" <jmorken@shaw.ca>
CC: xmldatadumps-l@lists.wikimedia.org
Fecha: martes, 16 de marzo, 2010 23:36


 LZMA SDK , provides a C-style API.  The only problem I find is that it requires pooling - recurrent calls to extract pieces of of data. So, I wrapped it with a C++ stream which I feed to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3 days to process all languages except English) .

On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken <jmorken@shaw.ca> wrote:

Hi,

Is this code available to process the 7zip data on the fly?  I had heard a rumour before that 7zip required multiple passes to decompress.

No, WikiXRay parser is also able to decompress it on the fly and store it in a local MySQL database (either complete dump or just metadata). It's also based on SAX, but coded in Python rather than C++.

Best,
F.

cheers,
Jamie



----- Original Message -----
From: Lev Muchnik <levmuchnik@gmail.com>
Date: Tuesday, March 16, 2010 1:55 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
To: Tomasz Finc <tfinc@wikimedia.org>
Cc: Wikimedia developers <wikitech-l@lists.wikimedia.org>, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org

> I am entirely for 7z. In fact, once released, I'll be able to
> test the XML
> integrity right away - I process the data on the fly,
> without  unpacking it
> first.
>
>
> On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
> <tfinc@wikimedia.org> wrote:
>
> > Kevin Webb wrote:
> > > I just managed to finish decompression. That took about 54
> hours on an
> > > EC2 2.5x unit CPU. The final data size is 5469GB.
> > >
> > > As the process just finished I haven't been able to check the
> > > integrity of the XML, however, the bzip stream itself
> appears to be
> > > good.
> > >
> > > As was mentioned previously, it would be great if you could
> compress> > future archives using pbzib to allow for parallel
> decompression. As I
> > > understand it, the pbzip files are reverse compatible with all
> > > existing bzip2 utilities.
> >
> > Looks like the trade off is slightly larger files due to pbzip2's
> > algorithm for individual chunking. We'd have to change the
> >
> > buildFilters function in http://tinyurl.com/yjun6n5 and
> install the new
> > binary. Ubuntu already has it in 8.04 LTS making it easy.
> >
> > Any takers for the change?
> >
> > I'd also like to gauge everyones opinion on moving away from
> the large
> > file sizes of bz2 and going exclusively 7z. We'd save a huge
> amount of
> > space doing it at a slightly larger cost during compression.
> > Decompression of 7z these days is wicked fast.
> >
> > let know
> >
> > --tomasz
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Xmldatadumps-admin-l mailing list
> > Xmldatadumps-admin-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
> admin-l
> >
>


-----Adjunto en línea a continuación-----

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l