Re: [Toolserver-l] Dumps handling / storage / updating etc...

11 Dec 2011

Hi there,
I have experience with this topic.

here is a simple read function :
use Compress::Bzip2 qw(:all );
use IO::Uncompress::Bunzip2 qw ($Bunzip2Error);
use IO::File;
sub ReadFile
{
    my $filename=shift;
    my $html="";
    my $fh;
    if ($filename =~/.bz2/)
    {
	$fh=IO::Uncompress::Bunzip2->new( $filename) or die "Couldn't open
bzipped input file: $Bunzip2Error\n";

    }
    else
    {
	$fh= IO::File->new( $filename) or die "Couldn't open input file $@\n";
    }

    while(<$fh>)
    {
	$html .= $_;
    }
  $html;
}

I have examples of how to process the huge bz file in parts here,
without downloading the whole thing
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/…
basically you can download with http a partialfile
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/…
 $req->init_header('Range' => sprintf("bytes=%s-%s",
					 $startpos ,
					 $endpos - 1
		  ));

then use bz2 recover to extract data from that block.

let me know if you have any questions

On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron &lt;jeremy(a)tuxmachine.com&gt; wrote:
...
  On Sat, Dec 10, 2011 at 14:18, Stefan Kühn
&lt;kuehn-s(a)gmx.net&gt; wrote:
  I work with perl and need the
 uncompressed file in XML to read the dump. I have no idea how to read
 with perl a compressed file. 
 Is it sufficient to receive the XML on stdin or do you need to be able to seek?

 It is trivial to give you XML on stdin e.g.
 $ < path/to/bz2 bzip2 -d | perl script.pl

 -Jeremy

 _______________________________________________
 Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/toolserver-l
 Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Dumps handling / storage / updating etc...