don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
The listing of dumps shows several recent aborted dumps. This is worrying
as I am heavily committed to a program of new-entry creation to eliminate
redlinks that depends on enwikt's dump. There seems to have been little
progress in clearing up the problem. When might a smooth, predictable flow
of dumps resume?
I have begun to make use of the Incremental XML Data Dumps, and have a few
For brevity, I shall coin two terms:
xdump - XML Data Dump
xincr - Incremental XML Data Dump
The checksum files for the `xincr's is not formatted correctly, causing
`md5sum' to throw an error. The correct format is:
(shell)$ cat simplewiki-20140703-md5sums.txt
(shell)$ md5sum --check simplewiki-20140703-md5sums.txt
md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum
Whereas no incremental SQL files are provided, I cannot use `mwxml2sql' and
must instead use `importDump.php'. However, I have encountered a few issues
when using `importDump.php' on `xincr's.
2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means
that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation
cannot be completed before the `xincr' for the next day is posted.
2.2) Pauses: Normally, when running `top', I can see at least on CPU at
near 100% for `php' and `mysql'. However, sometimes importation pauses for
several minutes, with no apparent CPU or disk activity. I assume that there
is a time-out somewhere that allows importation to proceed again. Any
comments on this phenomenon would be most welcome.
2.3) Fails: Sometimes importation fails. I see this often with the `xincr's
from `betawikiversity'. I have not yet isolated specific records that cause
failure. But it raises the question: Is `importDump.php' still supported?
Can you please advise as to the best method for importing `xincr's?
Is there another importation tool that you would recommend (one that is
both supported and fast)?
I recently came across several articles from the English Wikipedia for which the (page)ids in the dumps apparently have changed over time. E.g., in a dump from April 2011 as well as in another one from January 2012, the article "Marseille" had the id 71486, while the article with the same name currently (according to the MediaWiki API as well as in the May 2014 dump) has the id 40888948. Does anybody have an idea how this might have happened and whether this is a frequent phenomenon?
Doctoral Researcher | IT Administration
Ubiquitous Knowledge Processing (UKP Lab)
FB 20 Computer Science Department
Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany
phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111
Web Research at TU Darmstadt (WeRC)
Folks will have noticed that the dumps index.html page generated by the
monitor has changed. A bunch of content has been added, at the request
of the legal team, and the css has changed, stealing from the static
html page above it. If the new font sizes or whatever are too hard on
the eyes or you want to tweak the layout a bit to make it more readable,
see the file in this gerrit change:
and submit a patchset to puppet. Changes merged in the repository will
take effect by the next puppet run, i.e. in about half an hour.