Dear Ariel,
I have begun to make use of the Incremental XML Data Dumps, and have a few questions.
0) Acronymns
For brevity, I shall coin two terms:
xdump - XML Data Dump xincr - Incremental XML Data Dump
1) Checksums
The checksum files for the `xincr's is not formatted correctly, causing `md5sum' to throw an error. The correct format is:
<checksum><two spaces><filename><newline>
(shell)$ cat simplewiki-20140703-md5sums.txt d03f3a91ef0273eb814f39a1d13788cb c51f2bd5ef6bd42ce65cf4a7fca72400
(shell)$ md5sum --check simplewiki-20140703-md5sums.txt md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum lines found
2) maintenance/importDump.php
Whereas no incremental SQL files are provided, I cannot use `mwxml2sql' and must instead use `importDump.php'. However, I have encountered a few issues when using `importDump.php' on `xincr's.
2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation cannot be completed before the `xincr' for the next day is posted.
2.2) Pauses: Normally, when running `top', I can see at least on CPU at near 100% for `php' and `mysql'. However, sometimes importation pauses for several minutes, with no apparent CPU or disk activity. I assume that there is a time-out somewhere that allows importation to proceed again. Any comments on this phenomenon would be most welcome.
2.3) Fails: Sometimes importation fails. I see this often with the `xincr's from `betawikiversity'. I have not yet isolated specific records that cause failure. But it raises the question: Is `importDump.php' still supported?
3) Tools
Can you please advise as to the best method for importing `xincr's? Is there another importation tool that you would recommend (one that is both supported and fast)?
Sincerely Yours, Kent
wp mirror, 04/07/2014 23:33:
2.1) Speed: Importation proceeds at less than 0.1 pages/sec. This means that, for the largest wikis (commonswiki, enwiki, wikidatawiki) importation cannot be completed before the `xincr' for the next day is posted.
Did you try https://meta.wikimedia.org/wiki/Data_dumps/ImportDump.php ? https://www.mediawiki.org/wiki/Manual:Performance_tuning is also important, for instance in the last import I tried most time could be saved by setting proper l10n cache (on CDB).
Nemo
xmldatadumps-l@lists.wikimedia.org