Hello and happy new year to all fellow xml-crunchers out there. Anyone
who uses the bz2 file format of the page dumps, either for hadoop or for
some other reason might be interested in the following:
(For en wikipedia only for right now)
I've written a little program that will write a second copy of the
pages-articles bz2 file, as concatenated multiple bz2 streams with 100
pages per stream, with a separate index bz2 file which contains a list
of offsets/page ids/page titles where each offset is to the start of the
particular bz2 stream in the file.
The nice thing about multiple streams is that essentially these behave
like separate bz2 files, so you can just seek to that point in the file,
pass the data starting from that byte directly into the bz2 decompresser
of your choice, and work with it. No need to monkey around with
bit-aligned crap, nor with fudging together a bz2 header to fool the
library into thinking it's looking at a full file, nor with tossing away
the crc at the end.
The 100-pages-per stream makes the output somewhat bigger than the
regular pages-articles file but not excessively so.
I'm hoping this format will be useful to folks working with offline
readers, hadoop or other analysis tools. Let me know.
A first run of these, for the December en wp pages-articles file, is
and the second run (assuming this works, I just deployed the changes to
the python scripts) will be generated in the usual way as part of the
regular dumps and found with them at the normal location.
If the job runs ok I expect to enable it on the other projects soon