On Sun, Aug 2, 2009 at 8:19 AM, Merlijn van Deen valhallasw@arctus.nlwrote:
On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
is about 10 times slower than just using four (much more readable) lines of code:
<snip />
Additionally, xmlreader actually supports reading bzip2-ed xml (which is probably faster than unzipping and running, and possibly even faster than running it on the plain xml, depending on processor speed and disk speed):
Just for the fun of it, here are some "benchmarks" running on the XML-file:
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real 1m40.745s user 1m36.138s sys 0m1.472s stigmj@brage:~/t$ time ../bin/xml-search.pl nowiki-20090729-pages-articles.xml "{|" 0 > t.t
real 1m22.145s user 1m20.453s sys 0m1.204s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real 1m38.219s user 1m35.490s sys 0m1.800s stigmj@brage:~/t$ time ../bin/xml-search.pl nowiki-20090729-pages-articles.xml "{|" 0 > t.t
real 1m24.474s user 1m22.897s sys 0m1.236s
*Running with Bzip2'ed xml-file.*
stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real 3m59.687s user 3m53.591s sys 0m0.640s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t
real 2m42.841s user 4m8.804s sys 0m2.388s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real 3m53.044s user 3m48.510s sys 0m0.620s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t
real 2m49.320s user 4m10.772s sys 0m2.448s stigmj@brage:~/t$ time python ../pywikipedia/search.py > t.2
real 3m46.337s user 3m44.318s sys 0m0.644s stigmj@brage:~/t$ time bunzip2 -c nowiki-20090729-pages-articles.xml.bz2 | ../bin/xml-search.pl - "{|" 0 > t.t
real 2m54.216s user 4m11.724s sys 0m2.568s
When piping from bunzip2 I get to use both processors (Dual Xeon 3Ghz)... so it goes a little bit faster... :)
Well, this was a fun waste of time.. I believe the OP now has a solution either way.... ;)
/Stigmj