https://bugzilla.wikimedia.org/show_bug.cgi?id=55145
Web browser: --- Bug ID: 55145 Summary: weblinkchecker URL unicode problems Product: Pywikibot Version: unspecified Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: legoktm.wikipedia@gmail.com Classification: Unclassified Mobile Platform: ---
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1613/ Reported by: valhallasw Created on: 2013-04-13 19:55:05 Subject: weblinkchecker URL unicode problems Original description: As reported by Anima in https://sourceforge.net/tracker/?func=detail&aid=3602096&group%5C_id...
Weblinkchecker jumps through some strange unicode hoops. There is no such thing as a unicode URL - URLs are /always/ urlencoded UTF-8 strings, so: >>> urllib.quote(u"ö".encode('utf-8')) '%C3%B6'
anything else is *wrong*, including things like asking what encoding the web server uses: that is only relevant for decoding the page *text*.
Basic test case: >>> import weblinkchecker >>> lc = weblinkchecker.LinkChecker(u"http://svoya-igra.org/%D0%A0%D0%B0%D0%B9%D0%BA%D0%BE%D0%B2 Александр Вадимович/") Contacting server svoya-igra.org to find out its default encoding... Error retrieving server's default charset. Using ISO 8859-1. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "weblinkchecker.py", line 218, in __init__ self.changeUrl(url) File "weblinkchecker.py", line 275, in changeUrl self.path = unicode(urllib.quote(self.path.encode(encoding))) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 1-6: ordinal not in range(256)
valhallasw@lisilwen:~/src/pywikipedia/trunk/pywikipedia$ python version.py Pywikipedia [svn+ssh] valhallasw@trunk/pywikipedia (r11368, 2013/04/13, 08:16:45, ok) Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] config-settings: use_api = True use_api_login = True unicode test: ok
https://bugzilla.wikimedia.org/show_bug.cgi?id=55145
Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://sourceforge.net/p/p | |ywikipediabot/bugs/1613
https://bugzilla.wikimedia.org/show_bug.cgi?id=55145
Ricordisamoa ricordisamoa@openmailbox.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |ricordisamoa@openmailbox.or | |g Component|General |weblinkchecker.py
https://bugzilla.wikimedia.org/show_bug.cgi?id=55145
--- Comment #1 from Ricordisamoa ricordisamoa@openmailbox.org --- *** Bug 55318 has been marked as a duplicate of this bug. ***
pywikipedia-bugs@lists.wikimedia.org