Some time ago I announced a trial of bz2 multistream files for en
wikipedia on this list. The generated index files turned out to have a
problem with the offsets somewhere and due to various other tasks this
fell by the wayside.
That bug is now fixed, the September en wikipedia bz2 multistream index
file was regenerated, and a little toy offline reader is now available
as a proof of concept for how one might work with these files. A brief
reminder about what the format does: it allows rough random access to
the XML page content. For the code, see:
https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys/…
If I don't hear how broken things are over the next few days, I expect
to enable generation of this format for all wiki projects shortly.
Ariel