On 11/12/11 10:45, Stefan Kühn wrote:
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or
do you need to be able to seek?
It is trivial to give you XML on stdin e.g.
$< path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of
RAM on the server. I think this is no option for the future. Every
language grows every day and the dumps will also grow. The next problem
is the parallel use of a compressed file. If more user use this
compressed file like your idea, then bzip2 will crash the server IMHO.
I think it is no problem to store the uncompressed XML files for an easy
usage. We should make rules, where they have to stay and how long or we
need a list, where every user can say "I need only the two newest dumps
of enwiki, dewiki,...". If a dump is not needed, then we can delete this
file.
Stefan (sk)
You seem to think that piping the output from bzip2 will hold the xml
dump uncompressed in memory until your script processes it. That's wrong.
bzip2 will begin uncompressing and writing to the pipe, when the pipe
fills, it will get blocked. As your perl script reads from there,
there's space freed and the unbzipping can progress.