Hi there,
I have experience with this topic.
here is a simple read function :
use Compress::Bzip2 qw(:all );
use IO::Uncompress::Bunzip2 qw ($Bunzip2Error);
use IO::File;
sub ReadFile
{
my $filename=shift;
my $html="";
my $fh;
if ($filename =~/.bz2/)
{
$fh=IO::Uncompress::Bunzip2->new( $filename) or die "Couldn't open
bzipped input file: $Bunzip2Error\n";
}
else
{
$fh= IO::File->new( $filename) or die "Couldn't open input file $@\n";
}
while(<$fh>)
{
$html .= $_;
}
$html;
}
I have examples of how to process the huge bz file in parts here,
without downloading the whole thing
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/…
basically you can download with http a partialfile
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/…
$req->init_header('Range' => sprintf("bytes=%s-%s",
$startpos ,
$endpos - 1
));
then use bz2 recover to extract data from that block.
let me know if you have any questions
On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
On Sat, Dec 10, 2011 at 14:18, Stefan Kühn
<kuehn-s(a)gmx.net> wrote:
I work with perl and need the
uncompressed file in XML to read the dump. I have no idea how to read
with perl a compressed file.
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g.
$ < path/to/bz2 bzip2 -d | perl script.pl
-Jeremy
_______________________________________________
Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette
--
James Michael DuPont
Member of Free Libre Open Source Software Kosova
http://flossk.org