On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
is about 10 times slower than just using four (much more readable) lines of code:
(..snip..)
That may be, but when I tried your code on http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles... unpacking of course) I got this: Traceback (most recent call last): File "search.py", line 5, in <module> print page.title UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 1: ordinal not in range(128)
Yes, it breaks. To mimic the behaviour of your script (which blindly ignores the encoding and as such works), use page.title.encode('utf-8'), which should work fine.
Additionally, xmlreader actually supports reading bzip2-ed xml (which is probably faster than unzipping and running, and possibly even faster than running it on the plain xml, depending on processor speed and disk speed):
import xmlreader
for page in xmlreader.XmlDump('/home/valhallasw/download/nowiki-20090729-pages-articles.xml.bz2').parse(): if '{|' in page.text: print page.title.encode('utf-8')
valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ python stig.py > results valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ wc -l results 20890 results
(which includes one line 'Reading XML dump...', so that is the same result).
-Merlijn van Deen