Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
The listing of dumps shows several recent aborted dumps. This is worrying
as I am heavily committed to a program of new-entry creation to eliminate
redlinks that depends on enwikt's dump. There seems to have been little
progress in clearing up the problem. When might a smooth, predictable flow
of dumps resume?
Dear Ariel,
I have begun to make use of the Incremental XML Data Dumps, and have a few
questions.
0) Acronymns
For brevity, I shall coin two terms:
xdump - XML Data Dump
xincr - Incremental XML Data Dump
1) Checksums
The checksum files for the `xincr's is not formatted correctly, causing
`md5sum' to throw an error. The correct format is:
<checksum><two spaces><filename><newline>
(shell)$ cat simplewiki-20140703-md5sums.txt
d03f3a91ef0273eb814f39a1d13788cb
c51f2bd5ef6bd42ce65cf4a7fca72400
(shell)$ md5sum --check simplewiki-20140703-md5sums.txt
md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum
lines found
2) maintenance/importDump.php
Whereas no incremental SQL files are provided, I cannot use `mwxml2sql' and
must instead use `importDump.php'. However, I have encountered a few issues
when using `importDump.php' on `xincr's.
2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means
that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation
cannot be completed before the `xincr' for the next day is posted.
2.2) Pauses: Normally, when running `top', I can see at least on CPU at
near 100% for `php' and `mysql'. However, sometimes importation pauses for
several minutes, with no apparent CPU or disk activity. I assume that there
is a time-out somewhere that allows importation to proceed again. Any
comments on this phenomenon would be most welcome.
2.3) Fails: Sometimes importation fails. I see this often with the `xincr's
from `betawikiversity'. I have not yet isolated specific records that cause
failure. But it raises the question: Is `importDump.php' still supported?
3) Tools
Can you please advise as to the best method for importing `xincr's?
Is there another importation tool that you would recommend (one that is
both supported and fast)?
Sincerely Yours,
Kent
Hi,
I recently came across several articles from the English Wikipedia for which the (page)ids in the dumps apparently have changed over time. E.g., in a dump from April 2011 as well as in another one from January 2012, the article "Marseille" had the id 71486, while the article with the same name currently (according to the MediaWiki API as well as in the May 2014 dump) has the id 40888948. Does anybody have an idea how this might have happened and whether this is a frequent phenomenon?
Thanks,
Johannes
---
Johannes Daxenberger
Doctoral Researcher | IT Administration
Ubiquitous Knowledge Processing (UKP Lab)
FB 20 Computer Science Department
Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany
email: daxenberger(at)ukp.informatik.tu-darmstadt.de
phone: [+49] (0)6151 16-6227, fax: -5455, room: S2/02/B111
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC)
www.werc.tu-darmstadt.de
Folks will have noticed that the dumps index.html page generated by the
monitor has changed. A bunch of content has been added, at the request
of the legal team, and the css has changed, stealing from the static
html page above it. If the new font sizes or whatever are too hard on
the eyes or you want to tweak the layout a bit to make it more readable,
see the file in this gerrit change:
https://gerrit.wikimedia.org/r/#/c/143645/
and submit a patchset to puppet. Changes merged in the repository will
take effect by the next puppet run, i.e. in about half an hour.
Ariel