Hello !
I had a strange problem, an unicode bug in a custom script, occurring in a call to BeautifulSoup.py (see below for code and trace).. I couldn't find out why : according to BeautifulSoup documentation, BS fully supports unicode !?
I upgraded my local BeautifulSoup to 3.0.5 and everything is fine by now, since they fixed several encoding-related problems (see http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html ).
I remember having strange errors with the same trace, and the only way to avoid this would be to skip the page : definitely, pywikipedia should upgrade to 3.0.5 :þ
Trace was :
File "test2.py", line 169, in <module> main() File "test2.py", line 165, in main bot.run() File "test2.py", line 116, in run parseOnlyThese=SoupStrainer("title")) File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1282, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 946, in __init__ self._feed() File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed SGMLParser.feed(self, markup) File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag method(attrs) File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1372, in start_meta self._feed(self.declaredHTMLEncoding) File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed SGMLParser.feed(self, markup) File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
caused by the last line of that code, which aim was to get the html title of a page (chances are that using BeautifulSoup only to get this is not the most efficient way to do it ?! ) :
try: f = urllib2.urlopen(httplink) except urllib2.HTTPError, e: #... handler contentType = f.info().getheader('Content-Type') if not re.compile('text/html').search(contentType): #... soup = BeautifulSoup(f.read(), convertEntities=BeautifulSoup.HTML_ENTITIES, parseOnlyThese=SoupStrainer("title"))
Thanks !
pywikipedia-l@lists.wikimedia.org