Re: [Pywikipedia-l] search output

2 Aug 2009


      On Sun, Aug 2, 2009 at 8:19 AM, Merlijn van Deen valhallasw@arctus.nlwrote:
...
On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
...
...
is about 10 times slower than just using four (much more readable)
lines of code:
<snip />
...
Additionally, xmlreader actually supports reading bzip2-ed xml (which is
probably faster than unzipping and running, and possibly even faster than
running it on the plain xml, depending on processor speed and disk speed):
Just for the fun of it, here are some "benchmarks" running on the XML-file:
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real    1m40.745s
user    1m36.138s
sys     0m1.472s
stigmj@brage:~/t$ time ../bin/xml-search.pl
nowiki-20090729-pages-articles.xml "{|" 0 > t.t
real    1m22.145s
user    1m20.453s
sys     0m1.204s
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real    1m38.219s
user    1m35.490s
sys     0m1.800s
stigmj@brage:~/t$ time ../bin/xml-search.pl
nowiki-20090729-pages-articles.xml "{|" 0 > t.t
real    1m24.474s
user    1m22.897s
sys     0m1.236s
*Running with Bzip2'ed xml-file.*
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real    3m59.687s
user    3m53.591s
sys     0m0.640s
stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 |
../bin/xml-search.pl - "{|" 0 > t.t
real    2m42.841s
user    4m8.804s
sys     0m2.388s
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real    3m53.044s
user    3m48.510s
sys     0m0.620s
stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 |
../bin/xml-search.pl - "{|" 0 > t.t
real    2m49.320s
user    4m10.772s
sys     0m2.448s
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real    3m46.337s
user    3m44.318s
sys     0m0.644s
stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 |
../bin/xml-search.pl - "{|" 0 > t.t
real    2m54.216s
user    4m11.724s
sys     0m2.568s
When piping from bunzip2 I get to use both processors (Dual Xeon 3Ghz)... so
it goes a little bit faster... :)
Well, this was a fun waste of time.. I believe the OP now has a solution
either way.... ;)
/Stigmj

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] search output