https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
Bug ID: 67410 Summary: UnicodeDecodeError in reflinks.py Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: l.rabinelli+bugzilla@gmail.com Web browser: --- Mobile Platform: ---
I received an error during execution of the script:
python pwb.py reflinks.py "-xml:itwiki-20140612-pages-meta-current.xml.bz2"
Traceback (most recent call last): File "pwb.py", line 153, in <module> run_python_file(fn, argv, argvu) File "pwb.py", line 67, in run_python_file exec(compile(source, filename, "exec"), main_mod.__dict__) File "scripts/reflinks.py", line 824, in <module> main() File "scripts/reflinks.py", line 821, in main bot.run() File "scripts/reflinks.py", line 691, in run ref.transform() File "scripts/reflinks.py", line 236, in transform self.title = pywikibot.html2unicode(self.title) File "/data/project/betabot/core/pywikibot/page.py", line 3632, in html2unicode result += text UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128) Dropped throttle(s). <type 'exceptions.UnicodeDecodeError'>
My version is: Pywikibot: [https] r-pywikibot-core.git (e563873, g3466, 2014/07/02, 08:29:01, ok) Release version: 2.0b1 Python: 2.7.3 (default, Feb 27 2014, 19:58:35) [GCC 4.6.3] unicode test: ok
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
xqt info@gno.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |info@gno.de
--- Comment #1 from xqt info@gno.de --- You you have any hints for the page title?
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #2 from Beta16 l.rabinelli+bugzilla@gmail.com --- Sorry, the page is [[w:it:Dolomiti]]
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #3 from xqt info@gno.de --- Where is that sign in the title:
print unichr(0xe2)
â
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |valhallasw@arctus.nl
--- Comment #4 from Merlijn van Deen valhallasw@arctus.nl --- import pywikibot pywikibot.html2unicode('\xe2')
The issue is reflinks feeding non-unicode (i.e. str) data to html2unicode. Maybe an issue with xml parsing, maybe an issue with the xml files provided by the WMF.
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #5 from xqt info@gno.de --- Then unicode2html(html2unicode(text)) should solve it.
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #6 from Beta16 l.rabinelli+bugzilla@gmail.com --- The web page that causing the issue is: http://www.treccani.it/enciclopedia/mugo . Probably a coding issue between utf-8 and ISO-8859-1/windows-1252, because with the proposed correction in comment #5 (if I understand correctly) "Mugo nell’Enciclopedia Treccani" become "Mugo nell’Enciclopedia Treccani"
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #7 from Gerrit Notification Bot gerritadmin@wikimedia.org --- Change 144969 had a related patch set uploaded by Beta16: reflinks.py - UnicodeDecodeError
https://gerrit.wikimedia.org/r/144969
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
Gerrit Notification Bot gerritadmin@wikimedia.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |PATCH_TO_REVIEW
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jayvdb@gmail.com
--- Comment #8 from John Mark Vandenberg jayvdb@gmail.com --- I checked the URL http://www.treccani.it/enciclopedia/mugo with iconv , and it is OK.
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|PATCH_TO_REVIEW |ASSIGNED
--- Comment #9 from John Mark Vandenberg jayvdb@gmail.com --- (In reply to Merlijn van Deen from comment #4)
import pywikibot pywikibot.html2unicode('\xe2')
The issue is reflinks feeding non-unicode (i.e. str) data to html2unicode.
Stating that another way, .. it is giving html2unicode a str, which contains undecoded utf8.
In this case it is pretty normal HTML loading.
pywikibot.html2unicode(pywikibot.comms.http.request(site=None, uri='http://www.treccani.it/enciclopedia/mugo/'))
Traceback (most recent call last): File "<console>", line 1, in <module> File "/home/jvanden3/projects/pywiki/gerrit/pywikibot/page.py", line 3625, in html2unicode result += text[:match.start()] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 414: ordinal not in range(128)
This happens when it is trying to decode in the HTML.
Adding text=text.decode('utf-8') at the top solves this specific problem, because the HTML of that URL contains:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Of course it would need to be wrapped in a try: except: block.
Decoding 'text' is a reasonable assumption for a method called html2unicode. ;-( but how much of the html soup should a 'html2unicode' understand, especially when it is in the 'page' module.
If we're going to support proper deciphering of html in pywikibot (or do we have something like this already?), rather than expect each script to do it, the functionality should be in weblib, and it wouldnt hurt to use an existing library to do the grunt work. However, the 'ignore' list capability in html2unicode, which is needed by cosmetic_changes.py is unlikely to be part of existing libraries.
The immediate solution is for reflinks.py to decode('utf-8') the fetched page, before sending it to html2unicode, so I have -1'd the patch.
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
--- Comment #10 from Gerrit Notification Bot gerritadmin@wikimedia.org --- Change 144969 merged by jenkins-bot: reflinks.py - UnicodeDecodeError
https://gerrit.wikimedia.org/r/144969
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
Amir Ladsgroup ladsgroup@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |ladsgroup@gmail.com
--- Comment #11 from Amir Ladsgroup ladsgroup@gmail.com --- The patch is merged, so Is it still valid? May I close the bug?
https://bugzilla.wikimedia.org/show_bug.cgi?id=67410
Beta16 l.rabinelli+bugzilla@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |FIXED
pywikipedia-bugs@lists.wikimedia.org