Re: [Pywikipedia-l] SVN: [5802] trunk/pywikipedia/wikipedia.py

17 Aug 2008


      The reason why we we are encoding our strings/unicode objects to the
site's encoding is obvious : we don't trust ourselves to think to
encode properly each string before each put. And failing to properly
encode a string would result in a garbage write.
This has always been like this, my recent changes did not introduce
that behavior.
I introduced the unicode check because a user was not understanding
why a UnicodeDecodeError was triggered in the put of :
text = open('file_in_utf8_with_non-ascii_chars').read()
page.put(text)
Throwing a UnicodeDecodeError is not user-friendly at all. Hopefully,
most of the readers here understand that text should be decoded first
before passing it to put; but why aren't we throwing a nice error
instead of this UnicodeDecodeError, forcing the lambda user to check
into our code to understand what's happening ?
That's the reason of that change.
...
my_string = received_string.decode("ucs-2")
# my_string is now unicode string
mypage.put(my_string)
# works
Not always, it triggers a UnicodeDecodeError if the unicode string is
not convertible into the MediaWiki site's encoding.
...
Why silently assume that all strings provided by the script author are in MediaWiki site encoding?
We're not assuming this, we've always been encoding the strings to the
MediaWiki site encoding... Oo
Now look at revisions 5805-5806, and tell me what you think of the
implemented behavior :
* accept only plain strings we can reliably decode (i.e ascii strings)
* accept unicode strings
* catch UnicodeDecodeErrors if the unicode string can't be encoded to
the site's encoding to warn the user in a friendly fashion
-- 
Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]
pywikipedia & mediawiki

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] SVN: [5802] trunk/pywikipedia/wikipedia.py