On 12/11/2011 10:45 AM, Stefan Kühn wrote:
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow.
No, Stefan, it's not a matter of RAM, but of CPU. When your program reads from a pipe, the decompression program (bunzip2 or gunzip) consumes a few extra processor cycles every time your program reads the next kilobyte or megabyte of input. Most often, these CPU cycles are cheaper than storing the uncompressed XML file on disk.
Sometimes, reading compressed data and decompressing it, is also faster than reading the larger uncompressed data from disk.
If you read the entire compressed file into RAM and decompress it in RAM before starting to use it, then a lot of RAM will be needed. But there is no reason to do this for an XML file, which is always processed like a stream or sequence. (Remember that UNIX pipes were invented in a time when streaming data from one tape station to another was common, and a PDP-11 had 32 Kbyte of RAM.)
Here's how I read the *.sql.gz files in Perl:
my $page = "enwiki-20111128-page.sql.gz"; if ($page =~ /.gz$/) { open(PAGE, "gunzip <$page |"); } else { open(PAGE, "<$page"); } while (<PAGE>) { chomp; ...