don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
The MediaWiki Core team has opened a discussion about getting more involved
in and maybe redoing the dumps infrastructure. A good starting point is to
understand how folks use the dumps already or want to use them but can't,
and some questions about that are listed here:
I've added some notes but please go weigh in. Don't be shy about what you
do/what you need, this is the time to get it all on the table.
> I got stucked with an open source project which calls for enwiki-latest-pages-articles.xml.bz2
> while I only have enwiki-latest-pages-articles-multistream.xml.bz2, the network status is too
> bad for me to download another large file, so I wondered what is the difference between this
> two file, I have read the descriptions from https://dumps.wikimedia.org/ , however, I am
> confused about the concept 'in multiple bz2 streams, 100 pages per stream', could
> anyone explain it for me? thanks!
This file contains multiple bz2 streams - this means it is actually a
concatenation of multiple bz2 compressed files. The file
offsets of individual streams within the big multistream file. Just
make sure you have both files for the same dump version/date.
I got stucked with an open source project which calls for enwiki-latest-pages-articles.xml.bz2 while I only have enwiki-latest-pages-articles-multistream.xml.bz2, the network status is too bad for me to download another large file, so I wondered what is the difference between this two file, I have read the descriptions from https://dumps.wikimedia.org/ , however, I am confused about the concept 'in multiple bz2 streams, 100 pages per stream', could anyone explain it for me? thanks!
Wikivoyage data dumps used to get generated every 2 weeks, but since July
2014 the frequency has started to fall down dramatically. Currently, the
last dump is from 42 days ago.
It results in a level of disappointment as some users spend a lot of time
improving content in preparation for their next trip, hoping to use this
content with Wikivoyage offline browsers like Kiwix, only to find out that
data is very outdated.
The Wikivoyage community loves data dumps, and reuses them in a lot of
applications, for instance offline guides, GPS navigation apps, interactive
maps, and data validation.
Could the data dumps please be generated about every 2 weeks like before?
Thanks a lot for your consideration!