--- El mar, 16/3/10, Lev Muchnik <levmuchnik(a)gmail.com> escribió:
De: Lev Muchnik <levmuchnik(a)gmail.com>
Asunto: Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
Checksumming pages-meta-history.xml.bz2 :D
Para: "Jamie Morken" <jmorken(a)shaw.ca>
CC: xmldatadumps-l(a)lists.wikimedia.org
Fecha: martes, 16 de marzo, 2010 23:36
LZMA SDK , provides a C-style API. The only problem I find is that it requires pooling -
recurrent calls to extract pieces of of data. So, I wrapped it with a C++ stream which I
feed to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3 days to
process all languages except English) .
On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken <jmorken(a)shaw.ca> wrote:
Hi,
Is this code available to process the 7zip data on the fly? I had heard a rumour before
that 7zip required multiple passes to decompress.
No, WikiXRay parser is also able to decompress it on the fly and store it in a local MySQL
database (either complete dump or just metadata). It's also based on SAX, but coded in
Python rather than C++.
Best,
F.
cheers,
Jamie
----- Original Message -----
From: Lev Muchnik <levmuchnik(a)gmail.com>
Date: Tuesday, March 16, 2010 1:55 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming
pages-meta-history.xml.bz2 :D
To: Tomasz Finc <tfinc(a)wikimedia.org>
Cc: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>rg>,
xmldatadumps-admin-l(a)lists.wikimedia.org, Xmldatadumps-l(a)lists.wikimedia.org
I am entirely for 7z. In fact, once released, I'll
be able to
test the XML
integrity right away - I process the data on the fly,
without unpacking it
first.
On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
<tfinc(a)wikimedia.org> wrote:
> Kevin Webb wrote:
> > I just managed to finish decompression. That took about 54
hours on an
> > EC2 2.5x unit CPU. The final data size is 5469GB.
> >
> > As the process just finished I haven't been able to check the
> > integrity of the XML, however, the bzip stream itself
appears to be
> good.
>
> As was mentioned previously, it would be great if you could
compress>
> future archives using pbzib to allow for parallel
decompression. As I
> > understand it, the pbzip files are reverse
compatible with all
> > existing bzip2 utilities.
>
> Looks like the trade off is slightly larger files due to pbzip2's
algorithm for
individual chunking. We'd have to change the
buildFilters function in
http://tinyurl.com/yjun6n5 and
install the new
binary. Ubuntu
already has it in 8.04 LTS making it easy.
Any takers for the change?
I'd also like to gauge everyones opinion on moving away from
the large
file sizes of
bz2 and going exclusively 7z. We'd save a huge
amount of
> space doing it at a slightly larger cost during compression.
> Decompression of 7z these days is wicked fast.
>
> let know
>
> --tomasz
>
>
>
>
>
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
-----Adjunto en línea a continuación-----
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l