[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

Lev Muchnik levmuchnik at gmail.com
Tue Mar 16 23:28:56 UTC 2010


Hi Jamie,

Looks cool! Thanks for the link. It seams to serve a different purpose,
though. Looks like one can keep the data compressed and access it directly
in archive. That was never my objective. The setup I described is optimized
for one pass through the data. Perfect if you need to extract certain
elements and do not need repeated or random reads.

Lev

On Tue, Mar 16, 2010 at 7:13 PM, Jamie Morken <jmorken at shaw.ca> wrote:

> hi,
>
> I wonder how the zim file format: http://www.openzim.org/Main_Page
> would compare to the 7-zip file in regards to size and access speed?
>
>
> cheers,
> Jamie
>
>
> ----- Original Message -----
> From: Lev Muchnik <levmuchnik at gmail.com>
> Date: Tuesday, March 16, 2010 2:36 pm
> Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
> Checksumming pages-meta-history.xml.bz2 :D
> To: Jamie Morken <jmorken at shaw.ca>
> Cc: xmldatadumps-l at lists.wikimedia.org
>
> >  LZMA SDK  <http://www.7-zip.org/sdk.html>,
> > provides a C-style API.  The
> > only problem I find is that it requires pooling - recurrent
> > calls to extract
> > pieces of of data. So, I wrapped it with a C++ stream which I feed
> > to xerces-c SAX XML. SAX is really fun to use. And the speed is
> > amazing (3
> > days to process all languages except English) .
> >
> > On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken
> > <jmorken at shaw.ca> wrote:
> >
> > >
> > > Hi,
> > >
> > > Is this code available to process the 7zip data on the
> > fly?  I had heard a
> > > rumour before that 7zip required multiple passes to decompress.
> > >
> > > cheers,
> > > Jamie
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Lev Muchnik <levmuchnik at gmail.com>
> > > Date: Tuesday, March 16, 2010 1:55 pm
> > > Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki
> > > Checksumming pages-meta-history.xml.bz2 :D
> > > To: Tomasz Finc <tfinc at wikimedia.org>
> > > Cc: Wikimedia developers <wikitech-l at lists.wikimedia.org>,
> > > xmldatadumps-admin-l at lists.wikimedia.org,
> > > Xmldatadumps-l at lists.wikimedia.org
> > >
> > > > I am entirely for 7z. In fact, once released, I'll be able to
> > > > test the XML
> > > > integrity right away - I process the data on the fly,
> > > > without  unpacking it
> > > > first.
> > > >
> > > >
> > > > On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc
> > > > <tfinc at wikimedia.org> wrote:
> > > >
> > > > > Kevin Webb wrote:
> > > > > > I just managed to finish decompression. That took about 54
> > > > hours on an
> > > > > > EC2 2.5x unit CPU. The final data size is 5469GB.
> > > > > >
> > > > > > As the process just finished I haven't been able to
> > check the
> > > > > > integrity of the XML, however, the bzip stream itself
> > > > appears to be
> > > > > > good.
> > > > > >
> > > > > > As was mentioned previously, it would be great if you could
> > > > compress> > future archives using pbzib to allow for parallel
> > > > decompression. As I
> > > > > > understand it, the pbzip files are reverse compatible
> > with all
> > > > > > existing bzip2 utilities.
> > > > >
> > > > > Looks like the trade off is slightly larger files due to
> > pbzip2's> > > algorithm for individual chunking. We'd have to
> > change the
> > > > >
> > > > > buildFilters function in http://tinyurl.com/yjun6n5 and
> > > > install the new
> > > > > binary. Ubuntu already has it in 8.04 LTS making it easy.
> > > > >
> > > > > Any takers for the change?
> > > > >
> > > > > I'd also like to gauge everyones opinion on moving away from
> > > > the large
> > > > > file sizes of bz2 and going exclusively 7z. We'd save a huge
> > > > amount of
> > > > > space doing it at a slightly larger cost during compression.
> > > > > Decompression of 7z these days is wicked fast.
> > > > >
> > > > > let know
> > > > >
> > > > > --tomasz
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Xmldatadumps-admin-l mailing list
> > > > > Xmldatadumps-admin-l at lists.wikimedia.org
> > > > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
> > > > admin-l
> > > > >
> > > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20100316/fb213cb9/attachment.htm 


More information about the Xmldatadumps-l mailing list