I'd like to move ahead with producing multistream files for *all* bz2 compressed output by March 1. So if you have strenuous objections, now is the time to weigh in, see https://phabricator.wikimedia.org/T239866 !

Even files which are not produced by compressing and concatenating other files will have the mediawiki/siteinfo header as one bz2 stream, the mediawiki close tag as another bz2 stream, and the body containing all page and revision content as a third stream. This will allow us to generate pages-articles and pages-meta-current files from their parts 1-6 files in a matter of a few minutes, cutting out many hours from the dump runs overall.

Please check your tools using the files linked in the previous emails and make sure that they work.

Thanks!

Ariel

On Thu, Dec 5, 2019 at 12:01 AM Ariel Glenn WMF <ariel@wikimedia.org> wrote:
if you use one of the utilities listed here: https://phabricator.wikimedia.org/T239866
I'd like you to download one of the 'multistream' dumps and see if your utility decompresses it fully or not (you can compare the md5sum of the decompressed content to the regular file's decompressed content and see if they are the same). Then note the results and the version of the utility on this task.

Alternatively, if you use some other utility to work with the bz2 files, please test using that, and add that on the task too.

Here are two files for download and comparison of decompressed content:

https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles.xml.bz2
and
https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles-multistream.xml.bz2

Both are around 50 megabytes.

Thank you in advance to whomever participates!

Ariel