Hi,
I think we should keep at least one version of a recent bz2 enwiki pages-meta-history file because there are already some programs that use the bz2 format directly, and I don't know of any program that uses the 7z format natively.
heres some offline wiki readers that use the bz2 format:
bzreader: http://code.google.com/p/bzreader/
mzReader: http://homepage.ntlworld.com/bharat.vadera/MzReader/
wikitaxi: http://www.wikitaxi.org/
(note that none of these programs are currently setup for viewing the pages-meta-history revision data or discussion pages)
If there is no pages-meta-history in bz2 format available (currently 280GB for enwiki) then the 7z file will have to be converted to bz2 unless its possible to interface directly to the 7z file efficiently if this is even possible. Since the 7z file will decompress to 5469GB as Kevin showed, I think it would be hard for most people to decompress this 7z file, but the 280GB bz2 file is still a reasonable size and can be used without decompressing. So I think keeping at least a single recent bz2 file would be the way to go. The dewiki keeps about 6 of their pages-meta-history bz2 files (around 75GB each =450GB storage) http://download.wikimedia.org/dewiki/%C2%A0 so I think enwiki should be able to keep at least one, especially after all this time of not having any of these files for enwiki.
Also I wonder if it is possible to convert from 7z to bz2 without having to make the 5469GB file first? If this can be done then having only 7z files would be fine, as the bz2 file could be created with a "normal" PC (ie one without a 6TB+ harddrive). This would be a good solution, but not sure if it can be done. If it could though, might as well get rid of all the large wiki's bz2 pages-meta-history files!
cheers,
Jamie
----- Original Message ----- From: Tomasz Finc tfinc@wikimedia.org Date: Tuesday, March 16, 2010 12:45 pm Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D To: Kevin Webb kpwebb@gmail.com Cc: Wikimedia developers wikitech-l@lists.wikimedia.org, xmldatadumps-admin-l@lists.wikimedia.org, Xmldatadumps-l@lists.wikimedia.org
Kevin Webb wrote:
I just managed to finish decompression. That took about 54
hours on an
EC2 2.5x unit CPU. The final data size is 5469GB.
As the process just finished I haven't been able to check the integrity of the XML, however, the bzip stream itself appears
to be
good.
As was mentioned previously, it would be great if you could compress future archives using pbzib to allow for parallel
decompression. As I
understand it, the pbzip files are reverse compatible with all existing bzip2 utilities.
Looks like the trade off is slightly larger files due to pbzip2's algorithm for individual chunking. We'd have to change the
buildFilters function in http://tinyurl.com/yjun6n5 and install the new binary. Ubuntu already has it in 8.04 LTS making it easy.
Any takers for the change?
I'd also like to gauge everyones opinion on moving away from the large file sizes of bz2 and going exclusively 7z. We'd save a huge amount of space doing it at a slightly larger cost during compression. Decompression of 7z these days is wicked fast.
let know
--tomasz
Xmldatadumps-admin-l mailing list Xmldatadumps-admin-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l