[Pywikipedia-l] BeautifulSoup 3.0.5 is out !
Nicolas Dumazet
nicdumz at gmail.com
Sat Dec 29 14:01:44 UTC 2007
Hello !
I had a strange problem, an unicode bug in a custom script, occurring
in a call to BeautifulSoup.py (see below for code and trace).. I
couldn't find out why : according to BeautifulSoup documentation, BS
fully supports unicode !?
I upgraded my local BeautifulSoup to 3.0.5 and everything is fine by
now, since they fixed several encoding-related problems (see
http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html ).
I remember having strange errors with the same trace, and the only way
to avoid this would be to skip the page : definitely, pywikipedia
should upgrade to 3.0.5 :þ
Trace was :
File "test2.py", line 169, in <module>
main()
File "test2.py", line 165, in main
bot.run()
File "test2.py", line 116, in run
parseOnlyThese=SoupStrainer("title"))
File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1372,
in start_meta
self._feed(self.declaredHTMLEncoding)
File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
0: ordinal not in range(128)
caused by the last line of that code, which aim was to get the html
title of a page (chances are that using BeautifulSoup only to get this
is not the most efficient way to do it ?! ) :
try:
f = urllib2.urlopen(httplink)
except urllib2.HTTPError, e:
#... handler
contentType = f.info().getheader('Content-Type')
if not re.compile('text/html').search(contentType):
#...
soup = BeautifulSoup(f.read(),
convertEntities=BeautifulSoup.HTML_ENTITIES,
parseOnlyThese=SoupStrainer("title"))
Thanks !
--
Nicolas Dumazet,
Deuxième année ENSIMAG,
VP Comm' Ext' du Cercle des élèves de l'INPG.
06 03 88 92 29
More information about the Pywikipedia-l
mailing list