Re: [Pywikipedia-l] search output

2 Aug 2009


      On Sun, August 2, 2009 11:57 am, Stig Meireles Johansen wrote:
...
...
is about 10 times slower than just using four (much more readable)
lines of code:
(..snip..)
That may be, but when I tried your code on
http://download.wikimedia.org/nowiki/20090729/nowiki-20090729-pages-articles...
unpacking of course) I got this:
Traceback (most recent call last):
  File "search.py", line 5, in <module>
    print page.title
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
position
1: ordinal not in range(128)
Yes, it breaks. To mimic the behaviour of your script (which blindly
ignores the encoding and as such works), use page.title.encode('utf-8'),
which should work fine.
Additionally, xmlreader actually supports reading bzip2-ed xml (which is
probably faster than unzipping and running, and possibly even faster than
running it on the plain xml, depending on processor speed and disk speed):
import xmlreader
for page in
xmlreader.XmlDump('/home/valhallasw/download/nowiki-20090729-pages-articles.xml.bz2').parse():
  if '{|' in page.text:
    print page.title.encode('utf-8')
valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ python stig.py > results
valhallasw@elladan:~/pywikipedia/trunk/pywikipedia$ wc -l results         
     20890 results
(which includes one line 'Reading XML dump...', so that is the same result).
-Merlijn van Deen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] search output