Greetings, I am trying to import the French wiki (full history xml) on a Ubuntu machine with quad-core trendy CPU and 16 GB RAM. The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar --format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki -p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign key checks and unique checks. I have set the pool size, buffer log size, and the buffer size to large values as recommended for mysql good performance.
After around 3 minutes of running the above command, I have got:
6 pages (0.083/sec), 1,000 revs (13.889/sec) 8 pages (0.038/sec), 2,000 revs (9.378/sec) 13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql data folder is on another physical disks. Both disks are very fast.
*Any suggestions on how to improve the speed? * another issue is that the InnoDB (page, revision, text) do not show the number of the records although the size of the table is non-zero. I think the might be related to disable keys query. *Is that correct? * bilal
--- El sáb, 29/8/09, Bilal Abdul Kader bilalak@gmail.com escribió:
De: Bilal Abdul Kader bilalak@gmail.com Asunto: [Xmldatadumps-l] Very slow import of XML dumps Para: xmldatadumps-l@lists.wikimedia.org Fecha: sábado, 29 agosto, 2009 8:40 Greetings, I am trying to import the French wiki (full history xml) on a Ubuntu machine with quad-core trendy CPU and 16 GB RAM. The import query is the following:
java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar --format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki -p frwikiLatest
I have disabled the autocommit for mysql, disabled foreign key checks and unique checks. I have set the pool size, buffer log size, and the buffer size to large values as recommended for mysql good performance.
After around 3 minutes of running the above command, I have got:
6 pages (0.083/sec), 1,000 revs (13.889/sec) 8 pages (0.038/sec), 2,000 revs (9.378/sec)
13 pages (0.041/sec), 3,000 revs (9.458/sec)
The source file is on its own physical disk and the mysql data folder is on another physical disks. Both disks are very fast.
Any suggestions on how to improve the speed?
Hi Bilal.
Those are pretty standard results.
In the beginning, the dump usually contain articles with many revisions. So it's normal that it takes a lot of time for the parser to upload the info. Later on, it should speed up significantly.
The main bottleneck, however, is neither the parser, nor MySQL. It is 7zip. It has a remarkably good performance compressing, but according to some tests I did on my server (multi core, RAID disk, a lot of RAM and all "usual refinements") 7zip is also (and by far) the slowest one when it comes to uncompression. This is mainly because of the single-threaded nature of this action (compared with its multi-threaded compression).
Otherwise, perhaps you can give WikXRay Python parser a try:
http://meta.wikimedia.org/wiki/WikiXRay_Python_parser
another issue is that the InnoDB (page, revision, text) do not show the number of the records although the size of the table is non-zero. I think the might be related to disable keys query. Is that correct?
I don't understand your question. Do you mean you see some results with a plain select * but select count(*) shows nothing?? Perhaps you have a problem at some point in the uncompressing/parsing chain and you're not loading the data eventually on MySQL. Check that mwdumper is producing valid insert queries.
Best, F.
bilal
-----Adjunto en línea a continuación-----
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 8/31/09 11:24 AM, Felipe Ortega wrote:
The main bottleneck, however, is neither the parser, nor MySQL. It is 7zip. It has a remarkably good performance compressing, but according to some tests I did on my server (multi core, RAID disk, a lot of RAM and all "usual refinements") 7zip is also (and by far) the slowest one when it comes to uncompression. This is mainly because of the single-threaded nature of this action (compared with its multi-threaded compression).
This is entirely the opposite of my experience -- 7zip LZMA has always been *horribly* slow at compression (worse than anything else I've tried) while decompression is nearly as fast as gzip decompression.
Compare to bzip2, which is roughly symmetrical for compression and decompression, and both are orders of magnitude slower than gzip.
Can you provide a clear benchmark showing the alleged slow decompression?
-- brion
--- El mar, 1/9/09, Brion Vibber brion@wikimedia.org escribió:
De: Brion Vibber brion@wikimedia.org Asunto: Re: [Xmldatadumps-l] Very slow import of XML dumps Para: xmldatadumps-l@lists.wikimedia.org Fecha: martes, 1 septiembre, 2009 8:35 On 8/31/09 11:24 AM, Felipe Ortega wrote:
The main bottleneck, however, is neither the parser,
nor MySQL. It is
7zip. It has a remarkably good performance
compressing, but according
to some tests I did on my server (multi core, RAID
disk, a lot of RAM
and all "usual refinements") 7zip is also (and by far)
the slowest
one when it comes to uncompression. This is mainly
because of the
single-threaded nature of this action (compared with
its
multi-threaded compression).
This is entirely the opposite of my experience -- 7zip LZMA has always been *horribly* slow at compression (worse than anything else I've tried) while decompression is nearly as fast as gzip decompression.
Compare to bzip2, which is roughly symmetrical for compression and decompression, and both are orders of magnitude slower than gzip.
Can you provide a clear benchmark showing the alleged slow decompression?
-- brion
Sure. Following, I copy-paste the summary.
Maybe the programs have changed their performance over the past year, but I don't expect significant improvements.
I said 7zip has relatively good performance (specially for plain text files). Of course, GZIP is always the fastest in both actions (but also least optimal in terms of compression ratio, as expected). It's only surpassed by PIGZ, which I compiled from sources for this benchmark.
As for the best compressing ratios (BZIP, PBZIP and 7ZIP), I learned that the complex dictionaries needed for uncompression prevented the software to gain full advantage of multi-threading strategies. If you got symmetrical performance for PBZIP, may be you use some option/flag I'm missing ¿?
For example, I had to manually increase the number of threads for 7ZIP to speed it up, as you can see. It will depend on the number of cores you have, and the speed of your file system+RAID (with more cores, in theory you can increase the number of threads, until you don't get significant improvement, but I didn't tested that limit, yet).
Best, F.
=========== Tested on alswiki-meta-history.xml (XML ARCHIVE).
-rw------- 1 jfelipe jfelipe 748M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml
*******************************************
SUMMARY (ORIGINAL FILE: TEXTO XML 748M)
PROGRAM COMPRESSION UNCOMPRESSION COMP. SIZE
7ZIP 2m23s 11.84s 13M BZIP 4m33s 22.46s 35M PBZIP 2m41s 21.20s 35M GZIP 24.19s 10.42s 91M PIGZ 17.097s 9.771s 91M(*)
(*) BEST VERSION: 16 threads for compression, 2 threads for uncompression)
*******************************************
======================= DATA (VERBOSE :P) ======================= Data format:
PROGRAM NAME ============================= SIZE OF COMPRESSED FILE
UNCOMPRESSION
COMPRESSION
*******************************************
7ZIP. =====================================
-rw-r--r-- 1 jfelipe jfelipe 13M 2008-02-26 00:45 alswiki-latest-pages-meta-history.xml.7z
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time 7za e alswiki-latest-pages-meta-history.xml.7z
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25 p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: alswiki-latest-pages-meta-history.xml.7z
Extracting alswiki-latest-pages-meta-history.xml
Everything is Ok
real 0m11.854s user 0m6.740s sys 0m1.636s
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25 p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs) Scanning
Creating archive prueba.7z
Compressing alswiki-latest-pages-meta-history.xml
Everything is Ok
real 2m23.532s user 2m37.414s sys 0m2.744s
*********************************************************** ***********************************************************
BZIP ===========================================
-rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:40 alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2 -d alswiki-latest-pages-meta-history.xml.bz2
real 0m22.461s user 0m20.305s sys 0m1.668s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2 alswiki-latest-pages-meta-history.xml
real 4m33.034s user 4m28.181s sys 0m1.144s
*********************************************************** ***********************************************************
PBZIP (PARALLEL VERSION OF BZIP FOR SMPs) =========================================== -rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2 -d alswiki-latest-pages-meta-history.xml.bz2
real 0m21.205s user 0m35.902s sys 0m2.544s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2 alswiki-latest-pages-meta-history.xml
real 2m41.252s user 5m5.015s sys 0m2.100s
*********************************************************** ***********************************************************
GZIP ===========================================
-rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.gz
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gunzip alswiki-latest-pages-meta-history.xml.gz
real 0m10.427s user 0m5.072s sys 0m1.464s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gzip alswiki-latest-pages-meta-history.xml
real 0m24.192s user 0m23.157s sys 0m0.500s
*********************************************************** ***********************************************************
PIGZ (PARALLEL VERSION OF GZIP, FOR SMPs) =====================================
(NOTE:) jfelipe@bluestorm:~$ wget http://zlib.net/pigz17.c.gz jfelipe@bluestorm:~$ gcc -lpthread -lz -o pigz17 pigz17.c
Sample command line: pigz -p 10 -v (filename)
===== -rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.gz =====
(UNCOMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d alswiki-latest-pages-meta-history.xml.gz
real 0m9.859s user 0m3.868s sys 0m1.796s
(UNCOMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 2 alswiki-latest-pages-meta-history.xml.gz
real 0m9.771s user 0m3.360s sys 0m1.900s
(UNCOMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 10 alswiki-latest-pages-meta-history.xml.gz
real 0m10.509s user 0m3.536s sys 0m2.020s
(UNCOMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 16 alswiki-latest-pages-meta-history.xml.gz
real 0m11.451s user 0m3.844s sys 0m2.092s
*****
(COMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 2 alswiki-latest-pages-meta-history.xml
real 0m28.822s user 0m27.718s sys 0m0.856s
(COMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 alswiki-latest-pages-meta-history.xml
real 0m17.456s user 0m28.614s sys 0m0.916s
(COMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 10 alswiki-latest-pages-meta-history.xml
real 0m17.310s user 0m27.170s sys 0m0.808s
(COMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 16 alswiki-latest-pages-meta-history.xml
real 0m17.097s user 0m28.382s sys 0m0.864s
*********************************************************** THE END ***********************************************************
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org