--- El mar, 1/9/09, Brion Vibber <brion(a)wikimedia.org> escribió:
De: Brion Vibber <brion(a)wikimedia.org>
Asunto: Re: [Xmldatadumps-l] Very slow import of XML dumps
Para: xmldatadumps-l(a)lists.wikimedia.org
Fecha: martes, 1 septiembre, 2009 8:35
On 8/31/09 11:24 AM, Felipe Ortega
wrote:
The main bottleneck, however, is neither the
parser,
nor MySQL. It is
7zip. It has a remarkably good performance
compressing, but according
to some tests I did on my server (multi core,
RAID
disk, a lot of RAM
and all "usual refinements") 7zip is
also (and by far)
the slowest
one when it comes to uncompression. This is
mainly
because of the
single-threaded nature of this action (compared
with
its
multi-threaded compression).
This is entirely the opposite of my experience -- 7zip LZMA
has always
been *horribly* slow at compression (worse than anything
else I've
tried) while decompression is nearly as fast as gzip
decompression.
Compare to bzip2, which is roughly symmetrical for
compression and
decompression, and both are orders of magnitude slower than
gzip.
Can you provide a clear benchmark showing the alleged slow
decompression?
-- brion
Sure. Following, I copy-paste the summary.
Maybe the programs have changed their performance over the past year, but I don't
expect significant improvements.
I said 7zip has relatively good performance (specially for plain text files). Of course,
GZIP is always the fastest in both actions (but also least optimal in terms of compression
ratio, as expected). It's only surpassed by PIGZ, which I compiled from sources for
this benchmark.
As for the best compressing ratios (BZIP, PBZIP and 7ZIP), I learned that the complex
dictionaries needed for uncompression prevented the software to gain full advantage of
multi-threading strategies. If you got symmetrical performance for PBZIP, may be you use
some option/flag I'm missing ¿?
For example, I had to manually increase the number of threads for 7ZIP to speed it up, as
you can see. It will depend on the number of cores you have, and the speed of your file
system+RAID (with more cores, in theory you can increase the number of threads, until you
don't get significant improvement, but I didn't tested that limit, yet).
Best,
F.
===========
Tested on alswiki-meta-history.xml (XML ARCHIVE).
-rw------- 1 jfelipe jfelipe 748M 2008-02-24 20:56 alswiki-latest-pages-meta-history.xml
*******************************************
SUMMARY (ORIGINAL FILE: TEXTO XML 748M)
PROGRAM COMPRESSION UNCOMPRESSION COMP. SIZE
7ZIP 2m23s 11.84s 13M
BZIP 4m33s 22.46s 35M
PBZIP 2m41s 21.20s 35M
GZIP 24.19s 10.42s 91M
PIGZ 17.097s 9.771s 91M(*)
(*) BEST VERSION: 16 threads for compression, 2 threads for uncompression)
*******************************************
=======================
DATA (VERBOSE :P)
=======================
Data format:
PROGRAM NAME
=============================
SIZE OF COMPRESSED FILE
UNCOMPRESSION
COMPRESSION
*******************************************
7ZIP.
=====================================
-rw-r--r-- 1 jfelipe jfelipe 13M 2008-02-26 00:45
alswiki-latest-pages-meta-history.xml.7z
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time 7za e
alswiki-latest-pages-meta-history.xml.7z
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25
p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: alswiki-latest-pages-meta-history.xml.7z
Extracting alswiki-latest-pages-meta-history.xml
Everything is Ok
real 0m11.854s
user 0m6.740s
sys 0m1.636s
7-Zip (A) 4.51 beta Copyright (c) 1999-2007 Igor Pavlov 2007-07-25
p7zip Version 4.51 (locale=es_ES.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Scanning
Creating archive prueba.7z
Compressing alswiki-latest-pages-meta-history.xml
Everything is Ok
real 2m23.532s
user 2m37.414s
sys 0m2.744s
***********************************************************
***********************************************************
BZIP
===========================================
-rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:40
alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2 -d
alswiki-latest-pages-meta-history.xml.bz2
real 0m22.461s
user 0m20.305s
sys 0m1.668s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time bzip2
alswiki-latest-pages-meta-history.xml
real 4m33.034s
user 4m28.181s
sys 0m1.144s
***********************************************************
***********************************************************
PBZIP (PARALLEL VERSION OF BZIP FOR SMPs)
===========================================
-rw------- 1 jfelipe jfelipe 35M 2008-02-24 20:56
alswiki-latest-pages-meta-history.xml.bz2
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2 -d
alswiki-latest-pages-meta-history.xml.bz2
real 0m21.205s
user 0m35.902s
sys 0m2.544s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time pbzip2
alswiki-latest-pages-meta-history.xml
real 2m41.252s
user 5m5.015s
sys 0m2.100s
***********************************************************
***********************************************************
GZIP
===========================================
-rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56
alswiki-latest-pages-meta-history.xml.gz
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gunzip
alswiki-latest-pages-meta-history.xml.gz
real 0m10.427s
user 0m5.072s
sys 0m1.464s
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time gzip
alswiki-latest-pages-meta-history.xml
real 0m24.192s
user 0m23.157s
sys 0m0.500s
***********************************************************
***********************************************************
PIGZ (PARALLEL VERSION OF GZIP, FOR SMPs)
=====================================
(NOTE:)
jfelipe@bluestorm:~$ wget
http://zlib.net/pigz17.c.gz
jfelipe@bluestorm:~$ gcc -lpthread -lz -o pigz17 pigz17.c
Sample command line:
pigz -p 10 -v (filename)
=====
-rw------- 1 jfelipe jfelipe 91M 2008-02-24 20:56
alswiki-latest-pages-meta-history.xml.gz
=====
(UNCOMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d
alswiki-latest-pages-meta-history.xml.gz
real 0m9.859s
user 0m3.868s
sys 0m1.796s
(UNCOMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 2
alswiki-latest-pages-meta-history.xml.gz
real 0m9.771s
user 0m3.360s
sys 0m1.900s
(UNCOMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 10
alswiki-latest-pages-meta-history.xml.gz
real 0m10.509s
user 0m3.536s
sys 0m2.020s
(UNCOMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -d -p 16
alswiki-latest-pages-meta-history.xml.gz
real 0m11.451s
user 0m3.844s
sys 0m2.092s
*****
(COMPRESSION, 2 CORES, 2 THREADS (-p 2))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 2
alswiki-latest-pages-meta-history.xml
real 0m28.822s
user 0m27.718s
sys 0m0.856s
(COMPRESSION, 2 CORES, DEFAULT MODE)
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17
alswiki-latest-pages-meta-history.xml
real 0m17.456s
user 0m28.614s
sys 0m0.916s
(COMPRESSION, 2 CORES, 10 THREADS (-p 10))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 10
alswiki-latest-pages-meta-history.xml
real 0m17.310s
user 0m27.170s
sys 0m0.808s
(COMPRESSION, 2 CORES, 16 THREADS (-p 16))
jfelipe@bluestorm:~/Research/SVN/BerliOS/trunk/dumps$ time ./pigz17 -p 16
alswiki-latest-pages-meta-history.xml
real 0m17.097s
user 0m28.382s
sys 0m0.864s
***********************************************************
THE END
***********************************************************
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l