Re: [Pywikipedia-l] SVN: [5802] trunk/pywikipedia/wikipedia.py

18 Aug 2008


      2008/8/18 Marcin Cieslak saper@system.pl:
...
...
I introduced the unicode check because a user was not understanding
why a UnicodeDecodeError was triggered in the put of :
text = open('file_in_utf8_with_non-ascii_chars').read()
page.put(text)
Please educate your user about I/O with Python with Unicode characters :-)
Adding .decode("filescharset") after read() should help. This is not
a problem in pywikipedia, but in the user code.
Please read again what I wrote in my mail =] :
"Hopefully, most of the readers here understand that text should be
decoded first before passing it to put"
2008/8/18 Merlijn van Deen valhallasw@arctus.nl:
...
On Mon, August 18, 2008 4:00 am, Nicolas Dumazet wrote:
...
Not always, it triggers a UnicodeDecodeError if the unicode string is
not convertible into the MediaWiki site's encoding.
It doesn't. It triggers a UnicodeDecodeError if the string passed is
non-unicode and cannot be converted using the default (Latin-1) encoding.
My bad, I meant UnicodeEncodeError of course :
Assume that the site encoding, is for instance, the arabic charset
windows-1256, and try wikipedia.putting some asiatic unicode text, for
example u'人物' :)
However, I was not __this__ wrong :p :
UnicodeDecodeError is also raised if the bytestring cannot be
converted into ascii, yes. (and not latin-1 : 'é' can be a latin-1
bytestring, not an ascii string)
...
Your current implementation is wrong: it converts a bytestring (str) to
unicode instead of to the sites encoding.
what ? where ? :)
the arg.decode() line is intended as a test : "is arg an ascii
bytestring ?". It does not alter arg...
Here is the idea, considering the line :
str.encode(self.site().encoding())
* If str is unicode, that line can fail with UnicodeEncodeError. (
e.g. u'人物'.encode('latin-1') ). I want to raise an user friendly error
if that happens.
* If str is a string, this line is exactly equivalent to
str.decode('ascii').encode(self.site().encoding()), or shorter,
str.decode().encode(self.site().encoding())
** str.decode() can fail with UnicodeDecodeError, ( e.g. 'é'.decode()
) - that's what is being tested line 1290, "arg.decode()" . I want to
raise another error here, different from the first one.
** encode( ) cannot fail with UnicodeEncodeError since if we call it,
it means that decode succeeded, i.e. we are encoding ascii characters
My implementation was wrong and confused, true. But is it such a bad
idea to throw nice errors that explain in one line what's happening ?
I really don't like the idea of having our framework users forced to
dig into OUR code to understand what was wrong with THEIR code...
How is r5807 ?
...
Secondly, throwing a
PageNotSaved exception in a function that has nothing to do with a page
save (just with a page) is a bit strange...
Well, in the same function, _putPage, we nearly __only__ throw these
errors : PageNotSaved several times, and derivated SpamfilterError,
EditConflict and LongPageError. The only non-PageNotSaved is
LockedPage.
Note that I really don't mind what you use as an Error, as long as the
message is self-explanatory :)
...
the only good system is the old system. If needed, catch the
UnicodeDecodeException and raise a PageNotSaved exception in the /save/
function.
Not sure, to me _putPage is our only save function, called from both
put and async_put. Do you mean catching the errors above, in put and
async_put ?
If so, I would be quite reluctant to the idea of duplicating the
exception catch while catching them once for all in _putPage solves
the problem.
-- 
Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] SVN: [5802] trunk/pywikipedia/wikipedia.py