Hi Jamie,
Looks cool! Thanks for the link. It seams to serve a different purpose, though. Looks like one can keep the data compressed and access it directly in archive. That was never my objective. The setup I described is optimized for one pass through the data. Perfect if you need to extract certain elements and do not need repeated or random reads.
Lev
On Tue, Mar 16, 2010 at 7:13 PM, Jamie Morken jmorken@shaw.ca wrote:
hi,
I wonder how the zim file format: http://www.openzim.org/Main_Page would compare to the 7-zip file in regards to size and access speed?
cheers, Jamie
----- Original Message ----- From: Lev Muchnik levmuchnik@gmail.com Date: Tuesday, March 16, 2010 2:36 pm Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: Jamie Morken jmorken@shaw.ca Cc: xmldatadumps-l@lists.wikimedia.org
LZMA SDK http://www.7-zip.org/sdk.html, provides a C-style API. The only problem I find is that it requires pooling - recurrent calls to extract pieces of of data. So, I wrapped it with a C++ stream which I feed to xerces-c SAX XML. SAX is really fun to use. And the speed is amazing (3 days to process all languages except English) .
On Tue, Mar 16, 2010 at 6:17 PM, Jamie Morken jmorken@shaw.ca wrote:
Hi,
Is this code available to process the 7zip data on the
fly? I had heard a
rumour before that 7zip required multiple passes to decompress.
cheers, Jamie
----- Original Message ----- From: Lev Muchnik levmuchnik@gmail.com Date: Tuesday, March 16, 2010 1:55 pm Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: Tomasz Finc tfinc@wikimedia.org Cc: Wikimedia developers wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org
I am entirely for 7z. In fact, once released, I'll be able to test the XML integrity right away - I process the data on the fly, without unpacking it first.
On Tue, Mar 16, 2010 at 4:45 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Kevin Webb wrote:
I just managed to finish decompression. That took about 54
hours on an
EC2 2.5x unit CPU. The final data size is 5469GB.
As the process just finished I haven't been able to
check the
integrity of the XML, however, the bzip stream itself
appears to be
good.
As was mentioned previously, it would be great if you could
compress> > future archives using pbzib to allow for parallel decompression. As I
understand it, the pbzip files are reverse compatible
with all
existing bzip2 utilities.
Looks like the trade off is slightly larger files due to
pbzip2's> > > algorithm for individual chunking. We'd have to change the
buildFilters function in http://tinyurl.com/yjun6n5 and
install the new
binary. Ubuntu already has it in 8.04 LTS making it easy.
Any takers for the change?
I'd also like to gauge everyones opinion on moving away from
the large
file sizes of bz2 and going exclusively 7z. We'd save a huge
amount of
space doing it at a slightly larger cost during compression. Decompression of 7z these days is wicked fast.
let know
--tomasz
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-
admin-l