Re: [Toolserver-l] Dumps handling / storage / updating etc...

11 Dec 2011


      On 12/11/2011 10:45 AM, Stefan Kühn wrote:
...
Hmm, the stdin is possible, but I think this will need many memory of
RAM on the server. I think this is no option for the future. Every
language grows every day and the dumps will also grow.
No, Stefan, it's not a matter of RAM, but of CPU. When your program
reads from a pipe, the decompression program (bunzip2 or gunzip)
consumes a few extra processor cycles every time your program
reads the next kilobyte or megabyte of input. Most often, these CPU
cycles are cheaper than storing the uncompressed XML file on disk.
Sometimes, reading compressed data and decompressing it, is
also faster than reading the larger uncompressed data from disk.
If you read the entire compressed file into RAM and decompress it
in RAM before starting to use it, then a lot of RAM will be needed.
But there is no reason to do this for an XML file, which is always
processed like a stream or sequence. (Remember that UNIX pipes
were invented in a time when streaming data from one tape station
to another was common, and a PDP-11 had 32 Kbyte of RAM.)
Here's how I read the *.sql.gz files in Perl:
my $page = "enwiki-20111128-page.sql.gz";
     if ($page =~ /.gz$/) {
         open(PAGE, "gunzip <$page |");
     } else {
         open(PAGE, "<$page");
     }
     while (<PAGE>) {
         chomp;
         ...
-- 
   Lars Aronsson (lars@aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Dumps handling / storage / updating etc...