--- El mar, 1/9/09, Brion Vibber brion@wikimedia.org escribió:
De: Brion Vibber brion@wikimedia.org Asunto: Re: [Xmldatadumps-l] Very slow import of XML dumps Para: xmldatadumps-l@lists.wikimedia.org Fecha: martes, 1 septiembre, 2009 8:35 On 8/31/09 11:24 AM, Felipe Ortega wrote:
The main bottleneck, however, is neither the parser,
nor MySQL. It is
7zip. It has a remarkably good performance
compressing, but according
to some tests I did on my server (multi core, RAID
disk, a lot of RAM
and all "usual refinements") 7zip is also (and by far)
the slowest
one when it comes to uncompression. This is mainly
because of the
single-threaded nature of this action (compared with
its
multi-threaded compression).
This is entirely the opposite of my experience -- 7zip LZMA has always been *horribly* slow at compression (worse than anything else I've tried) while decompression is nearly as fast as gzip decompression.
Compare to bzip2, which is roughly symmetrical for compression and decompression, and both are orders of magnitude slower than gzip.
Can you provide a clear benchmark showing the alleged slow decompression?
-- brion
Sure. Following, I copy-paste the summary.
Maybe the programs have changed their performance over the past year, but I don't expect significant improvements.
I said 7zip has relatively good performance (specially for plain text files). Of course, GZIP is always the fastest in both actions (but also least optimal in terms of compression ratio, as expected). It's only surpassed by PIGZ, which I compiled from sources for this benchmark.
As for the best compressing ratios (BZIP, PBZIP and 7ZIP), I learned that the complex dictionaries needed for uncompression prevented the software to gain full advantage of multi-threading strategies. If you got symmetrical performance for PBZIP, may be you use some option/flag I'm missing ¿?
For example, I had to manually increase the number of threads for 7ZIP to speed it up, as you can see. It will depend on the number of cores you have, and the speed of your file system+RAID (with more cores, in theory you can increase the number of threads, until you don't get significant improvement, but I didn't tested that limit, yet).
Best, F.
=========== Tested on alswiki-meta-history.xml (XML ARCHIVE).
-rw------- 1 jfelipe jfelipe 748M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml
*******************************************
SUMMARY (ORIGINAL FILE: TEXTO XML 748M)
PROGRAM COMPRESSION UNCOMPRESSION COMP. SIZE
7ZIP 2m23s 11.84s 13M BZIP 4m33s 22.46s 35M PBZIP 2m41s 21.20s 35M GZIP 24.19s 10.42s 91M PIGZ 17.097s 9.771s 91M(*)
(*) BEST VERSION: 16 threads for compression, 2 threads for uncompression)
*******************************************
======================= DATA (VERBOSE :P) ======================= Data format:
PROGRAM NAME ============================= SIZE OF COMPRESSED FILE
UNCOMPRESSION
COMPRESSION
*******************************************
7ZIP. =====================================
-rw-r--r-- 1 jfelipe jfelipe 13M 2008-02-26 00:45 alswiki-latest-pages-meta-history.xml.7z
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time 7za e alswiki-latest-pages-meta-history.xml.7z
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25 p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: alswiki-latest-pages-meta-history.xml.7z
Extracting alswiki-latest-pages-meta-history.xml
Everything is Ok
real 0m11.854s user 0m6.740s sys 0m1.636s
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25 p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs) Scanning
Creating archive prueba.7z
Compressing alswiki-latest-pages-meta-history.xml
Everything is Ok
real 2m23.532s user 2m37.414s sys 0m2.744s
*********************************************************** ***********************************************************
BZIP ===========================================
-rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:40 alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2 -d alswiki-latest-pages-meta-history.xml.bz2
real 0m22.461s user 0m20.305s sys 0m1.668s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2 alswiki-latest-pages-meta-history.xml
real 4m33.034s user 4m28.181s sys 0m1.144s
*********************************************************** ***********************************************************
PBZIP (PARALLEL VERSION OF BZIP FOR SMPs) =========================================== -rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2 -d alswiki-latest-pages-meta-history.xml.bz2
real 0m21.205s user 0m35.902s sys 0m2.544s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2 alswiki-latest-pages-meta-history.xml
real 2m41.252s user 5m5.015s sys 0m2.100s
*********************************************************** ***********************************************************
GZIP ===========================================
-rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.gz
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gunzip alswiki-latest-pages-meta-history.xml.gz
real 0m10.427s user 0m5.072s sys 0m1.464s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gzip alswiki-latest-pages-meta-history.xml
real 0m24.192s user 0m23.157s sys 0m0.500s
*********************************************************** ***********************************************************
PIGZ (PARALLEL VERSION OF GZIP, FOR SMPs) =====================================
(NOTE:) jfelipe@bluestorm:~$ wget http://zlib.net/pigz17.c.gz jfelipe@bluestorm:~$ gcc -lpthread -lz -o pigz17 pigz17.c
Sample command line: pigz -p 10 -v (filename)
===== -rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml.gz =====
(UNCOMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d alswiki-latest-pages-meta-history.xml.gz
real 0m9.859s user 0m3.868s sys 0m1.796s
(UNCOMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 2 alswiki-latest-pages-meta-history.xml.gz
real 0m9.771s user 0m3.360s sys 0m1.900s
(UNCOMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 10 alswiki-latest-pages-meta-history.xml.gz
real 0m10.509s user 0m3.536s sys 0m2.020s
(UNCOMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 16 alswiki-latest-pages-meta-history.xml.gz
real 0m11.451s user 0m3.844s sys 0m2.092s
*****
(COMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 2 alswiki-latest-pages-meta-history.xml
real 0m28.822s user 0m27.718s sys 0m0.856s
(COMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 alswiki-latest-pages-meta-history.xml
real 0m17.456s user 0m28.614s sys 0m0.916s
(COMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 10 alswiki-latest-pages-meta-history.xml
real 0m17.310s user 0m27.170s sys 0m0.808s
(COMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 16 alswiki-latest-pages-meta-history.xml
real 0m17.097s user 0m28.382s sys 0m0.864s
*********************************************************** THE END ***********************************************************
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l