--- El sáb, 29/8/09, Bilal Abdul Kader bilalak@gmail.com escribió:
De: Bilal Abdul Kader bilalak@gmail.com Asunto: [Xmldatadumps-l] Very slow import of XML dumps Para: xmldatadumps-l@lists.wikimedia.org Fecha: sábado, 29 agosto, 2009 8:40 Greetings, I am trying to import the French wiki (full history xml) on a Ubuntu machine with quad-core trendy CPU and 16 GB RAM. The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar --format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki -p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign key checks and unique checks. I have set the pool size, buffer log size, and the buffer size to large values as recommended for mysql good performance.
After around 3 minutes of running the above command, I have got:
6 pages (0.083/sec), 1,000 revs (13.889/sec) 8 pages (0.038/sec), 2,000 revs (9.378/sec)
13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql data folder is on another physical disks. Both disks are very fast.
Any suggestions on how to improve the speed?
Hi Bilal.
Those are pretty standard results.
In the beginning, the dump usually contain articles with many revisions. So it's normal that it takes a lot of time for the parser to upload the info. Later on, it should speed up significantly.
The main bottleneck, however, is neither the parser, nor MySQL. It is 7zip. It has a remarkably good performance compressing, but according to some tests I did on my server (multi core, RAID disk, a lot of RAM and all "usual refinements") 7zip is also (and by far) the slowest one when it comes to uncompression. This is mainly because of the single-threaded nature of this action (compared with its multi-threaded compression).
Otherwise, perhaps you can give WikXRay Python parser a try:
http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
another issue is that the InnoDB (page, revision, text) do not show the number of the records although the size of the table is non-zero. I think the might be related to disable keys query. Is that correct?
I don't understand your question. Do you mean you see some results with a plain select * but select count(*) shows nothing?? Perhaps you have a problem at some point in the uncompressing/parsing chain and you're not loading the data eventually on MySQL. Check that mwdumper is producing valid insert queries.
Best, F.
bilal
-----Adjunto en línea a continuación-----
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l