Dear Ariel,
I have begun to make use of the Incremental XML Data Dumps, and have a few
questions.
0) Acronymns
For brevity, I shall coin two terms:
xdump - XML Data Dump
xincr - Incremental XML Data Dump
1) Checksums
The checksum files for the `xincr's is not formatted correctly, causing
`md5sum' to throw an error. The correct format is:
<checksum><two spaces><filename><newline>
(shell)$ cat simplewiki-20140703-md5sums.txt
d03f3a91ef0273eb814f39a1d13788cb
c51f2bd5ef6bd42ce65cf4a7fca72400
(shell)$ md5sum --check simplewiki-20140703-md5sums.txt
md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum
lines found
2) maintenance/importDump.php
Whereas no incremental SQL files are provided, I cannot use `mwxml2sql' and
must instead use `importDump.php'. However, I have encountered a few issues
when using `importDump.php' on `xincr's.
2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means
that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation
cannot be completed before the `xincr' for the next day is posted.
2.2) Pauses: Normally, when running `top', I can see at least on CPU at
near 100% for `php' and `mysql'. However, sometimes importation pauses for
several minutes, with no apparent CPU or disk activity. I assume that there
is a time-out somewhere that allows importation to proceed again. Any
comments on this phenomenon would be most welcome.
2.3) Fails: Sometimes importation fails. I see this often with the `xincr's
from `betawikiversity'. I have not yet isolated specific records that cause
failure. But it raises the question: Is `importDump.php' still supported?
3) Tools
Can you please advise as to the best method for importing `xincr's?
Is there another importation tool that you would recommend (one that is
both supported and fast)?
Sincerely Yours,
Kent