[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Jamie Morken
jmorken at shaw.ca
Wed Mar 17 02:11:44 UTC 2010
Hi,
I think we should keep at least one version of a recent bz2 enwiki
pages-meta-history file because there are already some programs that use
the bz2 format directly, and I don't know of any program that uses the
7z format natively.
heres some offline wiki readers that use the bz2 format:
bzreader: http://code.google.com/p/bzreader/
mzReader: http://homepage.ntlworld.com/bharat.vadera/MzReader/
wikitaxi: http://www.wikitaxi.org/
(note that none of these programs are currently setup for viewing the
pages-meta-history revision data or discussion pages)
If there is no pages-meta-history in bz2 format available (currently
280GB for enwiki) then the 7z file will have to be converted to bz2
unless its possible to interface directly to the 7z file efficiently if
this is even possible. Since the 7z file will decompress to 5469GB as
Kevin showed, I think it would be hard for most people to decompress
this 7z file, but the 280GB bz2 file is still a reasonable size and can
be used without decompressing. So I think keeping at least a single
recent bz2 file would be the way to go. The dewiki keeps about 6 of
their pages-meta-history bz2 files (around 75GB each =450GB storage)
http://download.wikimedia.org/dewiki/ so I think enwiki should be able
to keep at least one, especially after all this time of not having any
of these files for enwiki.
Also I wonder if it is possible to convert from 7z to bz2 without having
to make the 5469GB file first? If this can be done then having only 7z
files would be fine, as the bz2 file could be created with a "normal"
PC (ie one without a 6TB+ harddrive). This would be a good solution,
but not sure if it can be done. If it could though, might as well get
rid of all the large wiki's bz2 pages-meta-history files!
cheers,
Jamie
----- Original Message -----
From: Tomasz Finc <tfinc at wikimedia.org>
Date: Tuesday, March 16, 2010 12:45 pm
Subject: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
To: Kevin Webb <kpwebb at gmail.com>
Cc: Wikimedia developers <wikitech-l at lists.wikimedia.org>, xmldatadumps-admin-l at lists.wikimedia.org, Xmldatadumps-l at lists.wikimedia.org
> Kevin Webb wrote:
> > I just managed to finish decompression. That took about 54
> hours on an
> > EC2 2.5x unit CPU. The final data size is 5469GB.
> >
> > As the process just finished I haven't been able to check the
> > integrity of the XML, however, the bzip stream itself appears
> to be
> > good.
> >
> > As was mentioned previously, it would be great if you could compress
> > future archives using pbzib to allow for parallel
> decompression. As I
> > understand it, the pbzip files are reverse compatible with all
> > existing bzip2 utilities.
>
> Looks like the trade off is slightly larger files due to
> pbzip2's
> algorithm for individual chunking. We'd have to change the
>
> buildFilters function in http://tinyurl.com/yjun6n5 and install
> the new
> binary. Ubuntu already has it in 8.04 LTS making it easy.
>
> Any takers for the change?
>
> I'd also like to gauge everyones opinion on moving away from the
> large
> file sizes of bz2 and going exclusively 7z. We'd save a huge
> amount of
> space doing it at a slightly larger cost during compression.
> Decompression of 7z these days is wicked fast.
>
> let know
>
> --tomasz
>
>
>
>
>
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20100316/dedf5982/attachment.htm
More information about the Xmldatadumps-l
mailing list