[Pywikipedia-l] BeautifulSoup 3.0.5 is out !

Nicolas Dumazet nicdumz at gmail.com
Sat Dec 29 14:01:44 UTC 2007


Hello !

I had a strange problem, an unicode bug in a custom script, occurring
in a call to BeautifulSoup.py (see below for code and trace).. I
couldn't find out why : according to BeautifulSoup documentation, BS
fully supports unicode !?

I upgraded my local BeautifulSoup to 3.0.5 and everything is fine by
now, since they fixed several encoding-related problems (see
http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html ).

I remember having strange errors with the same trace, and the only way
to avoid this would be to skip the page : definitely, pywikipedia
should upgrade to 3.0.5 :þ

Trace was :

  File "test2.py", line 169, in <module>
    main()
  File "test2.py", line 165, in main
    bot.run()
  File "test2.py", line 116, in run
    parseOnlyThese=SoupStrainer("title"))
  File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
  File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 1372,
in start_meta
    self._feed(self.declaredHTMLEncoding)
  File "/home/nico/projets/pywikipedia/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position
0: ordinal not in range(128)

caused by the last line of that code, which aim was to get the html
title of a page (chances are that using BeautifulSoup only to get this
is not the most efficient way to do it ?! ) :

try:
    f = urllib2.urlopen(httplink)
except urllib2.HTTPError, e:
    #... handler
contentType = f.info().getheader('Content-Type')
if not re.compile('text/html').search(contentType):
    #...
soup = BeautifulSoup(f.read(),
              convertEntities=BeautifulSoup.HTML_ENTITIES,
              parseOnlyThese=SoupStrainer("title"))



Thanks !

-- 
Nicolas Dumazet,
Deuxième année ENSIMAG,
VP Comm' Ext' du Cercle des élèves de l'INPG.
06 03 88 92 29



More information about the Pywikipedia-l mailing list