Some time ago I announced a trial of bz2 multistream files for en wikipedia on this list. The generated index files turned out to have a problem with the offsets somewhere and due to various other tasks this fell by the wayside.
That bug is now fixed, the September en wikipedia bz2 multistream index file was regenerated, and a little toy offline reader is now available as a proof of concept for how one might work with these files. A brief reminder about what the format does: it allows rough random access to the XML page content. For the code, see: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys/b...
If I don't hear how broken things are over the next few days, I expect to enable generation of this format for all wiki projects shortly.
Ariel