Thank you very much for your answer, Merlijn.I don't have in plans to switch to Python
3 in the near future:)
Your second and third solutions works fine. I'm sticking to the third solution for
now.Thanks again for your help.Regards.Dan
Saturday, 24 June 2017 18:43:46, Merlijn van Deen (valhallasw)
On 23 June 2017 at 19:41, Dan <dan15i(a)yahoo.com> wrote:
Do PWB has issues with decoding URL strings?
Nothing in your example suggests it does:
test1 = urllib.unquote(m)
test2 = urllib.unquote_plus(m)
test3 = m.decode('utf8')
test4 = m.encode('utf8')
These are all questions of what the Python built-in urllib module does. In the case of
Python 2, the behavior is a bit odd, and I think this is what is causing your issue.
In your example, m = u'%c3%85', i.e., a unicode string with the text
"%C3%85". Urldecoding this should yield two bytes: the bytes C3 and 85, i.e, the
UTF-8 representation of Å.
However, what Python 2 does is it interprets u'%c3%85' to mean 'a unicode
string with characters U+00C3 U+0085', i.e., the characters Ã and [unprintable]. There
is no clean way to fix the situation after we have ended up there.
Now -- how to solve this?
- The most obvious solution is 'Use Python 3', where the unquote function
correctly processes the string.- Another option is to turn your URL into a bytestring
first, i.e., m = m.encode('utf-8'), then call unquote, then decode the string
again.- As you already have a dependency on pywikibot, the last option is to use
the pywikibot.page.url2unicode, which works correctly, even on Python 2.
Show replies by date