don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
there is a problem with the extracted page abstracts for Yahoo on the
big wikis moved to the new infrastructure. During generation everything
seems to be fine, but it ended with a 159kb file.
An other question: Why is this step not parallelized?
Hello XML dump users,
The workers that handle the rest of the wikis will finish what they are
working on (whichever runs are in progress now) and then will terminate.
When all are stopped I will be starting up these processes on hosts in
the other data center.
I'll be running 6 workers at a time which should put us in the time
frame of 8-9 days between runs per wiki. We'll see how that goes.
I was wondering on the order/sorting of revisions inside the pages-meta-history dumps, especially with respect to the namespaces. Does the order of revisions in the dumps account for namespaces (e.g. are revisions from the Template namespace located towards the end of the dump?) or is the order bound to any other parameter which potentially influences the location of revisions from certain namespaces?
I'm currently processing the (March 2013) dewiki dump.
Doctoral Researcher | IT Administration
Ubiquitous Knowledge Processing (UKP Lab)
FB 20 Computer Science Department
Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany
phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111
Web Research at TU Darmstadt (WeRC)
I've imported the data dumps into mysql and after running some queries I've
noticed the column "rev_len" in revision is empty and "page_len" in the
pages table is always equal to 0 for every row. Can anybody tell me
anything about this? I was really hoping to use this information.
Mail: 1925, 4th Street South
4-101 Hanson Hall
Minneapolis, MN 55455-0462
Email: v <renes(a)umn.edu>inya002(a)umn.edu
Office: 3-157 Hanson Hall
Phone: +1 6126257837
Hi, is there a file somewhere with a list of all namespace names and
numbers for all the Wikimedia wikis. The list should at least have the
canonical names, but preferably also any aliases. This is useful to find
out which namespaces interwiki links and language links goes to.
The canonical names are in the siteinfo section of the XML dumps of each
wiki, but not the aliases - and it is not practical to download complete
dumps for all projects just to get namespace names.
The following wikis are now running out of our datacenter in Ashburn:
eswiki, ptwiki, plwiki, ruwiki, jawiki, dewiki, frwiki, nlwiki, itwiki
Two worker processes handle these but each wiki is dumped in 4 parallel
jobs. This means that we'll be back to shorter and more frequent run
times for them all. A reminder that we will not be recombining the full
history dumps into one file, as that would undercut the speed gains we
make by parallelizing.
As with the enwiki move, you won't see up to the minute updates for
these wikis on the html page, since the data must be synced over before
updates can show up.
In a few days we'll start working on the move of the rest of the wikis.