[Pywikipedia-bugs] [Bug 67410] UnicodeDecodeError in reflinks.py

13 Jul 2014


      https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
John Mark Vandenberg jayvdb@gmail.com changed:
What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|PATCH_TO_REVIEW             |ASSIGNED
--- Comment #9 from John Mark Vandenberg jayvdb@gmail.com ---
(In reply to Merlijn van Deen from comment #4)
...
import pywikibot
pywikibot.html2unicode('\xe2')
The issue is reflinks feeding non-unicode (i.e. str) data to html2unicode.
Stating that another way, .. it is giving html2unicode a str, which contains
undecoded utf8.
In this case it is pretty normal HTML loading.
...
...
...
pywikibot.html2unicode(pywikibot.comms.http.request(site=None, uri='http://www.treccani.it/enciclopedia/mugo/'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/jvanden3/projects/pywiki/gerrit/pywikibot/page.py", line 3625, in
html2unicode
    result += text[:match.start()]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 414:
ordinal not in range(128)
This happens when it is trying to decode &nbsp; in the HTML.
Adding text=text.decode('utf-8') at the top solves this specific problem,
because the HTML of that URL contains:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Of course it would need to be wrapped in a try: except: block.
Decoding 'text' is a reasonable assumption for a method called html2unicode.
;-(  but how much of the html soup should a 'html2unicode' understand,
especially when it is in the 'page' module.
If we're going to support proper deciphering of html in pywikibot (or do we
have something like this already?), rather than expect each script to do it,
the functionality should be in weblib, and it wouldnt hurt to use an existing
library to do the grunt work.  However, the 'ignore' list capability in
html2unicode, which is needed by cosmetic_changes.py is unlikely to be part of
existing libraries.
The immediate solution is for reflinks.py to decode('utf-8') the fetched page,
before sending it to html2unicode, so I have -1'd the patch.
-- 
You are receiving this mail because:
You are the assignee for the bug.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[Pywikipedia-bugs] [Bug 67410] UnicodeDecodeError in reflinks.py