I'd like to move ahead with producing multistream files for *all* bz2
compressed output by March 1. So if you have strenuous objections, now is
the time to weigh in, see https://phabricator.wikimedia.org/T239866
Even files which are not produced by compressing and concatenating other
files will have the mediawiki/siteinfo header as one bz2 stream, the
mediawiki close tag as another bz2 stream, and the body containing all page
and revision content as a third stream. This will allow us to generate
pages-articles and pages-meta-current files from their parts 1-6 files in a
matter of a few minutes, cutting out many hours from the dump runs overall.
Please check your tools using the files linked in the previous emails and
make sure that they work.
On Thu, Dec 5, 2019 at 12:01 AM Ariel Glenn WMF <ariel(a)wikimedia.org> wrote:
if you use one of the utilities listed here:
I'd like you to download one of the 'multistream' dumps and see if your
utility decompresses it fully or not (you can compare the md5sum of the
decompressed content to the regular file's decompressed content and see if
they are the same). Then note the results and the version of the utility on
Alternatively, if you use some other utility to work with the bz2 files,
please test using that, and add that on the task too.
Here are two files for download and comparison of decompressed content:
Both are around 50 megabytes.
Thank you in advance to whomever participates!