Re: [pywikibot] Decoding strings issue in PWB

24 Jun 2017


      Hi Dan,
On 23 June 2017 at 19:41, Dan dan15i@yahoo.com wrote:
...
Hi.
Do PWB has issues with decoding URL strings?
Nothing in your example suggests it does:
...
    test1 = urllib.unquote(m)
    test2 = urllib.unquote_plus(m)
    test3 = m.decode('utf8')
    test4 = m.encode('utf8')


These are all questions of what the Python built-in urllib module does. In
the case of Python 2, the behavior is a bit odd, and I think this is what
is causing your issue.
In your example, m = u'%c3%85', i.e., a unicode string with the text
"%C3%85". Urldecoding this should yield two *bytes*: the bytes C3 and 85,
i.e, the UTF-8 representation of Å.
However, what Python 2 does is it interprets u'%c3%85' to mean 'a unicode
string with characters U+00C3 U+0085', i.e., the characters Ã and
[unprintable]. There is no clean way to fix the situation after we have
ended up there.
Now -- how to solve this?
- The most obvious solution is 'Use Python 3', where the unquote function
correctly processes the string.
- Another option is to turn your URL into a bytestring first, i.e., m =
m.encode('utf-8'), then call unquote, then decode the string again.
- As you already have a dependency on pywikibot, the last option is to use
the pywikibot.page.url2unicode, which works correctly, even on Python 2.
Best,
Merlijn

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [pywikibot] Decoding strings issue in PWB