New subject: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

16 Mar 2010

 LZMA SDK  <http://www.7-zip.org/sdk.html>, provides a C-style API.  The
only problem I find is that it requires pooling - recurrent calls to extract
pieces of of data. So, I wrapped it with a C++ stream which I feed
to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3
days to process all languages except English) .

On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken &lt;jmorken(a)shaw.ca&gt; wrote:

...

 Hi,

 Is this code available to process the 7zip data on the fly?  I had heard a
 rumour before that 7zip required multiple passes to decompress.

 cheers,
 Jamie

 ----- Original Message -----
 From: Lev Muchnik &lt;levmuchnik(a)gmail.com&gt;
 Date: Tuesday, March 16, 2010 1:55 pm
 Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
 Checksumming pages-meta-history.xml.bz2 :D
 To: Tomasz Finc &lt;tfinc(a)wikimedia.org&gt;
 Cc: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;rg>,
 xmldatadumps-admin-l(a)lists.wikimedia.org,
 Xmldatadumps-l(a)lists.wikimedia.org

  I am entirely for 7z. In fact, once released,
I'll be able to
 test the XML
 integrity right away - I process the data on the fly,
 without  unpacking it
 first.

 On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
 &lt;tfinc(a)wikimedia.org&gt; wrote:

  Kevin Webb wrote:
 > I just managed to finish decompression. That took about 54  hours on an
  > EC2 2.5x unit CPU. The final data size is
5469GB.
   > As the process just finished I
haven't been able to check the
 > integrity of the XML, however, the bzip stream itself  appears to be
  > good.
   > As was mentioned previously, it would
be great if you could  compress> > future archives using pbzib to allow for
parallel
 decompression. As I

understand it, the pbzip files are reverse compatible with all
 existing bzip2 utilities. 
 Looks like the trade off is slightly larger files due to pbzip2's
 algorithm for individual chunking. We'd have to change the

 buildFilters function in http://tinyurl.com/yjun6n5 and  install the new
  binary. Ubuntu already has it in 8.04 LTS making
it easy.

 Any takers for the change?

 I'd also like to gauge everyones opinion on moving away from  the large
  file sizes of bz2 and going exclusively 7z.
We'd save a huge  amount of
  space doing it at a slightly larger cost during
compression.
 Decompression of 7z these days is wicked fast.

 let know

 --tomasz

 _______________________________________________
 Xmldatadumps-admin-l mailing list
 Xmldatadumps-admin-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-  admin-l

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D