On 11/12/11 10:45, Stefan Kühn wrote:
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or do you need to be able to seek?
It is trivial to give you XML on stdin e.g. $< path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO.
I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say "I need only the two newest dumps of enwiki, dewiki,...". If a dump is not needed, then we can delete this file.
Stefan (sk)
You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress.