Hello and happy new year to all fellow xml-crunchers out there. Anyone who uses the bz2 file format of the page dumps, either for hadoop or for some other reason might be interested in the following:
(For en wikipedia only for right now)
I've written a little program that will write a second copy of the pages-articles bz2 file, as concatenated multiple bz2 streams with 100 pages per stream, with a separate index bz2 file which contains a list of offsets/page ids/page titles where each offset is to the start of the particular bz2 stream in the file.
The nice thing about multiple streams is that essentially these behave like separate bz2 files, so you can just seek to that point in the file, pass the data starting from that byte directly into the bz2 decompresser of your choice, and work with it. No need to monkey around with bit-aligned crap, nor with fudging together a bz2 header to fool the library into thinking it's looking at a full file, nor with tossing away the crc at the end.
The 100-pages-per stream makes the output somewhat bigger than the regular pages-articles file but not excessively so.
I'm hoping this format will be useful to folks working with offline readers, hadoop or other analysis tools. Let me know.
A first run of these, for the December en wp pages-articles file, is available at http://dumps.wikimedia.org/other/multistream/ and the second run (assuming this works, I just deployed the changes to the python scripts) will be generated in the usual way as part of the regular dumps and found with them at the normal location.
If the job runs ok I expect to enable it on the other projects soon afterwards.
Ariel
xmldatadumps-l@lists.wikimedia.org