Hi. Do PWB has issues with decoding URL strings?
Try this script: from __future__ import absolute_import, unicode_literals
import re, urllib import pywikibot
mylist = \ [ u"Åge Hovengen", u"Åge Konradsen", u"Åge Ramberg", ] for a in mylist: ssite = pywikibot.getSite("en") spage = pywikibot.Page(ssite, a) text = spage.get() m0 = re.search(ur"{{\s*Stortingetbio\s*|\s*(?:id=)?\s*([^\s}|]+)\s*[|}]", text, flags=re.IGNORECASE) if m0: m = m0.group(1) test1 = urllib.unquote(m) test2 = urllib.unquote_plus(m) test3 = m.decode('utf8') test4 = m.encode('utf8') pywikibot.output(test1) pywikibot.output(test2) pywikibot.output(test3) pywikibot.output(test4)
It doesn't decode for me %c3%85 to ÅWhile on http://repl.it/Izdw/2 you can see that pure python can decode that string sequence with urllib.unquote and urllib.unquote_plus.Is this a PWB bug or what?
Hi Dan,
On 23 June 2017 at 19:41, Dan dan15i@yahoo.com wrote:
Hi. Do PWB has issues with decoding URL strings?
Nothing in your example suggests it does:
test1 = urllib.unquote(m) test2 = urllib.unquote_plus(m) test3 = m.decode('utf8') test4 = m.encode('utf8')
These are all questions of what the Python built-in urllib module does. In the case of Python 2, the behavior is a bit odd, and I think this is what is causing your issue.
In your example, m = u'%c3%85', i.e., a unicode string with the text "%C3%85". Urldecoding this should yield two *bytes*: the bytes C3 and 85, i.e, the UTF-8 representation of Å.
However, what Python 2 does is it interprets u'%c3%85' to mean 'a unicode string with characters U+00C3 U+0085', i.e., the characters à and [unprintable]. There is no clean way to fix the situation after we have ended up there.
Now -- how to solve this?
- The most obvious solution is 'Use Python 3', where the unquote function correctly processes the string. - Another option is to turn your URL into a bytestring first, i.e., m = m.encode('utf-8'), then call unquote, then decode the string again. - As you already have a dependency on pywikibot, the last option is to use the pywikibot.page.url2unicode, which works correctly, even on Python 2.
Best, Merlijn