Hi there, I have experience with this topic.
here is a simple read function : use Compress::Bzip2 qw(:all ); use IO::Uncompress::Bunzip2 qw ($Bunzip2Error); use IO::File; sub ReadFile { my $filename=shift; my $html=""; my $fh; if ($filename =~/.bz2/) { $fh=IO::Uncompress::Bunzip2->new( $filename) or die "Couldn't open bzipped input file: $Bunzip2Error\n";
} else { $fh= IO::File->new( $filename) or die "Couldn't open input file $@\n"; }
while(<$fh>) { $html .= $_; } $html; }
I have examples of how to process the huge bz file in parts here, without downloading the whole thing http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... basically you can download with http a partialfile http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/v... $req->init_header('Range' => sprintf("bytes=%s-%s", $startpos , $endpos - 1 ));
then use bz2 recover to extract data from that block.
let me know if you have any questions
On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron jeremy@tuxmachine.com wrote:
On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kuehn-s@gmx.net wrote:
I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file.
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $ < path/to/bz2 bzip2 -d | perl script.pl
-Jeremy
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette