don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
I am currently planning to process the last french dump. I would like to
ask if somebody has already found or used a good OpenNLP french sentence
detection model. If yes please let me know where to find one.
Thanks in advance,
Dear List Members,
Does anyone know if the WikiMedia Foundation plans to support Metalink
or SPDY for its dump files and/or image files? See RFP references
WP-MIRROR downloads dump and image files to build a mirror of a set of
wikipedias. WP-MIRROR 0.5 is feature complete. I am now looking for
ways to optimize performance (i.e. reduce mirror build time). Were
the WMF to support the above two protocols, downloads would be faster
and require less time spent on validation.
On 12/29/12, Sumana Harihareswara <sumanah(a)wikimedia.org> wrote:
> Hello! I'm sorry, but I don't know the answer to these questions;
> perhaps you could email the dumps mailing list
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ? My
> Sumana Harihareswara
> Engineering Community Manager
> Wikimedia Foundation
> On Sun, Dec 16, 2012 at 6:14 AM, wp mirror <wpmirrordev(a)gmail.com> wrote:
>> Dear Sumana,
>> 1) Metalink. Does the Wikimedia Foundation have any plans to support
>> metalink for either its dump files or its image files?
>> <http://tools.ietf.org/html/rfc5854>, "The Metalink Download Description
>> <http://tools.ietf.org/html/rfc6249>, "Metalink/HTTP: Mirrors and Hashes"
>> 2) SPDY. Does the Wikimedia Foundation have any plans to support SPDY?
>> Documentation: <http://www.chromium.org/spdy>
>> 3) WP-MIRROR. We last communicated 2012-01-06 in regards to WP-MIRROR.
>> Status: WP-MIRROR 0.5 is `feature complete', and works
>> `out-of-the-box' for the GNU/Linux distributions: Debian 7.0 (wheezy)
>> and Ubuntu 12.10 (quantal).
>> Future: Attention is turning towards performance enhancement and
>> porting to other distributions.
>> Homepage: <http://www.nongnu.org/wp-mirror/>
>> Please give it a try. Feedback is most welcome.
>> Sincerely Yours,
I had an email exchange wth one of the folks at our mirror sites about
the low volume of traffic they are getting. Clearly we need to
publicize this list better, bearing in mind that files on our mirrors
may be a day behind the live site. I wouldn't think that a day's delay
is very important in the grand scheme of things though.
So I'm looking for suggestions on how to best make the list of mirrors
visible to dumps users/downloaders. This includes changes to  and
 among other things. Bear in mind that'best' also implies 'easy to
do' or 'here is a patch' :-D
(downliad page for all dumps showing each dump in order of completion)
(download page for a given dump)
Snapshot1, which was running several dumps for 'big' wikis, fell over
due to swapdeath today. While we investigate the issue, those jobs will
be stalled. I'll send an update as soon as we have more info.