2008/8/18 Marcin Cieslak saper@system.pl:
I introduced the unicode check because a user was not understanding why a UnicodeDecodeError was triggered in the put of : text = open('file_in_utf8_with_non-ascii_chars').read() page.put(text)
Please educate your user about I/O with Python with Unicode characters :-) Adding .decode("filescharset") after read() should help. This is not a problem in pywikipedia, but in the user code.
Please read again what I wrote in my mail =] : "Hopefully, most of the readers here understand that text should be decoded first before passing it to put"
2008/8/18 Merlijn van Deen valhallasw@arctus.nl:
On Mon, August 18, 2008 4:00 am, Nicolas Dumazet wrote:
Not always, it triggers a UnicodeDecodeError if the unicode string is not convertible into the MediaWiki site's encoding.
It doesn't. It triggers a UnicodeDecodeError if the string passed is non-unicode and cannot be converted using the default (Latin-1) encoding.
My bad, I meant UnicodeEncodeError of course : Assume that the site encoding, is for instance, the arabic charset windows-1256, and try wikipedia.putting some asiatic unicode text, for example u'人物' :) However, I was not __this__ wrong :p : UnicodeDecodeError is also raised if the bytestring cannot be converted into ascii, yes. (and not latin-1 : 'é' can be a latin-1 bytestring, not an ascii string)
Your current implementation is wrong: it converts a bytestring (str) to unicode instead of to the sites encoding.
what ? where ? :) the arg.decode() line is intended as a test : "is arg an ascii bytestring ?". It does not alter arg...
Here is the idea, considering the line : str.encode(self.site().encoding())
* If str is unicode, that line can fail with UnicodeEncodeError. ( e.g. u'人物'.encode('latin-1') ). I want to raise an user friendly error if that happens. * If str is a string, this line is exactly equivalent to str.decode('ascii').encode(self.site().encoding()), or shorter, str.decode().encode(self.site().encoding()) ** str.decode() can fail with UnicodeDecodeError, ( e.g. 'é'.decode() ) - that's what is being tested line 1290, "arg.decode()" . I want to raise another error here, different from the first one. ** encode( ) cannot fail with UnicodeEncodeError since if we call it, it means that decode succeeded, i.e. we are encoding ascii characters
My implementation was wrong and confused, true. But is it such a bad idea to throw nice errors that explain in one line what's happening ? I really don't like the idea of having our framework users forced to dig into OUR code to understand what was wrong with THEIR code...
How is r5807 ?
Secondly, throwing a PageNotSaved exception in a function that has nothing to do with a page save (just with a page) is a bit strange...
Well, in the same function, _putPage, we nearly __only__ throw these errors : PageNotSaved several times, and derivated SpamfilterError, EditConflict and LongPageError. The only non-PageNotSaved is LockedPage. Note that I really don't mind what you use as an Error, as long as the message is self-explanatory :)
the only good system is the old system. If needed, catch the UnicodeDecodeException and raise a PageNotSaved exception in the /save/ function.
Not sure, to me _putPage is our only save function, called from both put and async_put. Do you mean catching the errors above, in put and async_put ? If so, I would be quite reluctant to the idea of duplicating the exception catch while catching them once for all in _putPage solves the problem.