Re: [Xmldatadumps-l] Very slow import of XML dumps

31 Aug 2009


      --- El sáb, 29/8/09, Bilal Abdul Kader bilalak@gmail.com escribió:
...
De: Bilal Abdul Kader bilalak@gmail.com
Asunto: [Xmldatadumps-l] Very slow import of XML dumps
Para: xmldatadumps-l@lists.wikimedia.org
Fecha: sábado, 29 agosto, 2009 8:40
Greetings,
I am trying to import the French wiki (full history xml) on
a Ubuntu machine with quad-core trendy CPU and 16 GB RAM.
The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC
-verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m
-XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9
-XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar
--format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2
| mysql -u wiki -p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign
key checks and unique checks. I have set the pool size,
buffer log size, and the buffer size to large values as
recommended for mysql good performance.
After around 3 minutes of running the above command, I have
got:
6 pages (0.083/sec), 1,000 revs (13.889/sec)
8 pages (0.038/sec), 2,000 revs (9.378/sec)
13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql
data folder is on another physical disks. Both disks are
very fast.
Any suggestions on how to improve the speed?
Hi Bilal.
Those are pretty standard results.
In the beginning, the dump usually contain articles with many revisions. So it's normal that it takes a lot of time for the parser to upload the info. Later on, it should speed up significantly.
The main bottleneck, however, is neither the parser, nor MySQL. It is 7zip. It has a remarkably good performance compressing, but according to some tests I did on my server (multi core, RAID disk, a lot of RAM and all "usual refinements") 7zip is also (and by far) the slowest one when it comes to uncompression. This is mainly because of the single-threaded nature of this action (compared with its multi-threaded compression).
Otherwise, perhaps you can give WikXRay Python parser a try:
http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
...
another issue is that the InnoDB (page, revision, text) do
not show the number of the records although the size of the
table is non-zero. I think the might be related to disable
keys query. 
Is that correct?
I don't understand your question. Do you mean you see some results with a plain select * but select count(*) shows nothing?? Perhaps you have a problem at some point in the uncompressing/parsing chain and you're not loading the data eventually on MySQL. Check that mwdumper is producing valid insert queries.
Best,
F.
...
bilal
-----Adjunto en línea a continuación-----

Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Very slow import of XML dumps