<div><br></div> <a href="http://www.7-zip.org/sdk.html">LZMA SDK </a>, provides a C-style API. The only problem I find is that it requires pooling - recurrent calls to extract pieces of of data. So, I wrapped it with a C++ stream which I feed to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3 days to process all languages except English) .<div>
<div><br><div class="gmail_quote">On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken <span dir="ltr"><<a href="mailto:jmorken@shaw.ca">jmorken@shaw.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>Hi,<br><br>Is this code available to process the 7zip data on the fly? I had heard a rumour before that 7zip required multiple passes to decompress.<br><br>cheers,<br><font color="#888888">Jamie</font><div><div></div>
<div class="h5"><br><br><br>----- Original Message -----<br>From: Lev Muchnik <<a href="mailto:levmuchnik@gmail.com" target="_blank">levmuchnik@gmail.com</a>><br>Date: Tuesday, March 16, 2010 1:55 pm<br>Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D<br>
To: Tomasz Finc <<a href="mailto:tfinc@wikimedia.org" target="_blank">tfinc@wikimedia.org</a>><br>Cc: Wikimedia developers <<a href="mailto:wikitech-l@lists.wikimedia.org" target="_blank">wikitech-l@lists.wikimedia.org</a>>, <a href="mailto:xmldatadumps-admin-l@lists.wikimedia.org" target="_blank">xmldatadumps-admin-l@lists.wikimedia.org</a>, <a href="mailto:Xmldatadumps-l@lists.wikimedia.org" target="_blank">Xmldatadumps-l@lists.wikimedia.org</a><br>
<br>> I am entirely for 7z. In fact, once released, I'll be able to <br>> test the XML<br>> integrity right away - I process the data on the fly, <br>> without unpacking it<br>> first.<br>> <br>> <br>
> On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc <br>> <<a href="mailto:tfinc@wikimedia.org" target="_blank">tfinc@wikimedia.org</a>> wrote:<br>> <br>> > Kevin Webb wrote:<br>> > > I just managed to finish decompression. That took about 54 <br>
> hours on an<br>> > > EC2 2.5x unit CPU. The final data size is 5469GB.<br>> > ><br>> > > As the process just finished I haven't been able to check the<br>> > > integrity of the XML, however, the bzip stream itself <br>
> appears to be<br>> > > good.<br>> > ><br>> > > As was mentioned previously, it would be great if you could <br>> compress> > future archives using pbzib to allow for parallel <br>> decompression. As I<br>
> > > understand it, the pbzip files are reverse compatible with all<br>> > > existing bzip2 utilities.<br>> ><br>> > Looks like the trade off is slightly larger files due to pbzip2's<br>
> > algorithm for individual chunking. We'd have to change the<br>> ><br>> > buildFilters function in <a href="http://tinyurl.com/yjun6n5" target="_blank">http://tinyurl.com/yjun6n5</a> and <br>> install the new<br>
> > binary. Ubuntu already has it in 8.04 LTS making it easy.<br>> ><br>> > Any takers for the change?<br>> ><br>> > I'd also like to gauge everyones opinion on moving away from <br>> the large<br>
> > file sizes of bz2 and going exclusively 7z. We'd save a huge <br>> amount of<br>> > space doing it at a slightly larger cost during compression.<br>> > Decompression of 7z these days is wicked fast.<br>
> ><br>> > let know<br>> ><br>> > --tomasz<br>> ><br>> ><br>> ><br>> ><br>> ><br>> ><br>> > _______________________________________________<br>> > Xmldatadumps-admin-l mailing list<br>
> > <a href="mailto:Xmldatadumps-admin-l@lists.wikimedia.org" target="_blank">Xmldatadumps-admin-l@lists.wikimedia.org</a><br>> > <a href="https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-" target="_blank">https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-</a><br>
> admin-l<br>> ><br>>
</div></div></blockquote></div><br></div></div>