Revision: 5802 Author: nicdumz Date: 2008-08-17 07:40:05 +0000 (Sun, 17 Aug 2008)
Log Message: ----------- _putPage was crashing when either the text or the comment argument was a non-ascii string :
p.put('?\195?\182?\195?\164?\195?\182$?\195?\188?\195?\182$?\195?\164?\195?\182?\195?\164?\195?\167%2234qwdadasd', 'Test')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/nico/projets/devpywiki/wikipedia.py", line 1284, in put newPage, self.site().getToken(sysop = sysop), sysop = sysop) File "/home/nico/projets/devpywiki/wikipedia.py", line 1300, in _putPage encodedText = text.encode(self.site().encoding()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Now raising ValueError when the text or comment is not Unicode
Modified Paths: -------------- trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py =================================================================== --- trunk/pywikipedia/wikipedia.py 2008-08-16 11:18:09 UTC (rev 5801) +++ trunk/pywikipedia/wikipedia.py 2008-08-17 07:40:05 UTC (rev 5802) @@ -1294,8 +1294,12 @@ host = self.site().hostname() # Get the address of the page on that host. address = self.site().put_address(self.urlname()) + if not isinstance(comment, unicode): + raise ValueError("An unicode edit comment is expected as an argument") # Use the proper encoding for the comment encodedComment = comment.encode(self.site().encoding()) + if not isinstance(text, unicode): + raise ValueError("An unicode wikitext is expected as an argument") # Encode the text into the right encoding for the wiki encodedText = text.encode(self.site().encoding()) predata = {
nicdumz@svn.wikimedia.org wrote:
Revision: 5802 Author: nicdumz Date: 2008-08-17 07:40:05 +0000 (Sun, 17 Aug 2008)
Log Message:
_putPage was crashing when either the text or the comment argument was a non-ascii string :
p.put('?\195?\182?\195?\164?\195?\182$?\195?\188?\195?\182$?\195?\164?\195?\182?\195?\164?\195?\167%2234qwdadasd', 'Test')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/nico/projets/devpywiki/wikipedia.py", line 1284, in put newPage, self.site().getToken(sysop = sysop), sysop = sysop) File "/home/nico/projets/devpywiki/wikipedia.py", line 1300, in _putPage encodedText = text.encode(self.site().encoding()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Now raising ValueError when the text or comment is not Unicode
I think this is wrong, why don't you use this instead:
p.put(u"my unicode text", u"my unicode comment")?
Now simple ASCII strings are broken:
akson$ cat text.py import wikipedia site = wikipedia.getSite('pl','wikipedia') p = wikipedia.Page(site, "Wikipedysta/Miner/Brudnopis") p.put("ABC", "DEF") akson$ python text.py Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "text.py", line 4, in <module> p.put("ABC", "DEF") File "wikipedia.py", line 1284, in put newPage, self.site().getToken(sysop = sysop), sysop = sysop) File "wikipedia.py", line 1298, in _putPage raise ValueError("An unicode edit comment is expected as an argument") ValueError: An unicode edit comment is expected as an argument
--Marcin
yep, thanks; it should be better by r5804 :)
2008/8/17 Marcin Cieslak saper@system.pl:
nicdumz@svn.wikimedia.org wrote:
Revision: 5802 Author: nicdumz Date: 2008-08-17 07:40:05 +0000 (Sun, 17 Aug 2008)
Log Message:
_putPage was crashing when either the text or the comment argument was a non-ascii string :
p.put('?\195?\182?\195?\164?\195?\182$?\195?\188?\195?\182$?\195?\164?\195?\182?\195?\164?\195?\167%2234qwdadasd', 'Test')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/nico/projets/devpywiki/wikipedia.py", line 1284, in put newPage, self.site().getToken(sysop = sysop), sysop = sysop) File "/home/nico/projets/devpywiki/wikipedia.py", line 1300, in _putPage encodedText = text.encode(self.site().encoding()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Now raising ValueError when the text or comment is not Unicode
I think this is wrong, why don't you use this instead:
p.put(u"my unicode text", u"my unicode comment")?
Now simple ASCII strings are broken:
akson$ cat text.py import wikipedia site = wikipedia.getSite('pl','wikipedia') p = wikipedia.Page(site, "Wikipedysta/Miner/Brudnopis") p.put("ABC", "DEF") akson$ python text.py Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "text.py", line 4, in <module> p.put("ABC", "DEF") File "wikipedia.py", line 1284, in put newPage, self.site().getToken(sysop = sysop), sysop = sysop) File "wikipedia.py", line 1298, in _putPage raise ValueError("An unicode edit comment is expected as an argument") ValueError: An unicode edit comment is expected as an argument
--Marcin
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Nicolas Dumazet wrote:
yep, thanks; it should be better by r5804 :)
Well, but I don't understand the _need_ for this change at all:\
--- trunk/pywikipedia/wikipedia.py 2008-08-17 14:39:50 UTC (rev 5803) +++ trunk/pywikipedia/wikipedia.py 2008-08-17 15:14:02 UTC (rev 5804) @@ -1294,14 +1294,16 @@ host = self.site().hostname() # Get the address of the page on that host. address = self.site().put_address(self.urlname()) - if not isinstance(comment, unicode): - raise ValueError("An unicode edit comment is expected as an argument") # Use the proper encoding for the comment - encodedComment = comment.encode(self.site().encoding()) - if not isinstance(text, unicode): - raise ValueError("An unicode wikitext is expected as an argument") + try: + encodedComment = comment.encode(self.site().encoding()) + except UnicodeDecodeError: + raise ValueError("An ascii string or unicode edit comment is expected as an argument") # Encode the text into the right encoding for the wiki - encodedText = text.encode(self.site().encoding()) + try: + encodedText = text.encode(self.site().encoding()) + except UnicodeDecodeError: + raise ValueError("An ascii string or unicode wikitext is expected as an argument") predata = { 'wpSave': '1', 'wpSummary': encodedComment,
Why assume that the string given is provided in the site's encoding?
The "site" encoding - the MediaWiki site you talk to over HTTP is different then Python or your script encoding.
If I have input received from external source (file, database, HTTP) I encode it manually. Say, in this example, I got something in UCS-2:
my_string = received_string.decode("ucs-2") # my_string is now unicode string mypage.put(my_string) # works
Why silently assume that all strings provided by the script author are in MediaWiki site encoding?
A plain ASCII strings should be passed unconverted, and all Unicode strings should be python "unicode" objects.
--Marcin
The reason why we we are encoding our strings/unicode objects to the site's encoding is obvious : we don't trust ourselves to think to encode properly each string before each put. And failing to properly encode a string would result in a garbage write. This has always been like this, my recent changes did not introduce that behavior.
I introduced the unicode check because a user was not understanding why a UnicodeDecodeError was triggered in the put of : text = open('file_in_utf8_with_non-ascii_chars').read() page.put(text) Throwing a UnicodeDecodeError is not user-friendly at all. Hopefully, most of the readers here understand that text should be decoded first before passing it to put; but why aren't we throwing a nice error instead of this UnicodeDecodeError, forcing the lambda user to check into our code to understand what's happening ?
That's the reason of that change.
my_string = received_string.decode("ucs-2") # my_string is now unicode string mypage.put(my_string) # works
Not always, it triggers a UnicodeDecodeError if the unicode string is not convertible into the MediaWiki site's encoding.
Why silently assume that all strings provided by the script author are in MediaWiki site encoding?
We're not assuming this, we've always been encoding the strings to the MediaWiki site encoding... Oo
Now look at revisions 5805-5806, and tell me what you think of the implemented behavior : * accept only plain strings we can reliably decode (i.e ascii strings) * accept unicode strings * catch UnicodeDecodeErrors if the unicode string can't be encoded to the site's encoding to warn the user in a friendly fashion
On Mon, August 18, 2008 4:00 am, Nicolas Dumazet wrote:
Not always, it triggers a UnicodeDecodeError if the unicode string is not convertible into the MediaWiki site's encoding.
It doesn't. It triggers a UnicodeDecodeError if the string passed is non-unicode and cannot be converted using the default (Latin-1) encoding.
Your current implementation is wrong: it converts a bytestring (str) to unicode instead of to the sites encoding. Secondly, throwing a PageNotSaved exception in a function that has nothing to do with a page save (just with a page) is a bit strange...
the only good system is the old system. If needed, catch the UnicodeDecodeException and raise a PageNotSaved exception in the /save/ function.
--Merlijn
2008/8/18 Marcin Cieslak saper@system.pl:
I introduced the unicode check because a user was not understanding why a UnicodeDecodeError was triggered in the put of : text = open('file_in_utf8_with_non-ascii_chars').read() page.put(text)
Please educate your user about I/O with Python with Unicode characters :-) Adding .decode("filescharset") after read() should help. This is not a problem in pywikipedia, but in the user code.
Please read again what I wrote in my mail =] : "Hopefully, most of the readers here understand that text should be decoded first before passing it to put"
2008/8/18 Merlijn van Deen valhallasw@arctus.nl:
On Mon, August 18, 2008 4:00 am, Nicolas Dumazet wrote:
Not always, it triggers a UnicodeDecodeError if the unicode string is not convertible into the MediaWiki site's encoding.
It doesn't. It triggers a UnicodeDecodeError if the string passed is non-unicode and cannot be converted using the default (Latin-1) encoding.
My bad, I meant UnicodeEncodeError of course : Assume that the site encoding, is for instance, the arabic charset windows-1256, and try wikipedia.putting some asiatic unicode text, for example u'人物' :) However, I was not __this__ wrong :p : UnicodeDecodeError is also raised if the bytestring cannot be converted into ascii, yes. (and not latin-1 : 'é' can be a latin-1 bytestring, not an ascii string)
Your current implementation is wrong: it converts a bytestring (str) to unicode instead of to the sites encoding.
what ? where ? :) the arg.decode() line is intended as a test : "is arg an ascii bytestring ?". It does not alter arg...
Here is the idea, considering the line : str.encode(self.site().encoding())
* If str is unicode, that line can fail with UnicodeEncodeError. ( e.g. u'人物'.encode('latin-1') ). I want to raise an user friendly error if that happens. * If str is a string, this line is exactly equivalent to str.decode('ascii').encode(self.site().encoding()), or shorter, str.decode().encode(self.site().encoding()) ** str.decode() can fail with UnicodeDecodeError, ( e.g. 'é'.decode() ) - that's what is being tested line 1290, "arg.decode()" . I want to raise another error here, different from the first one. ** encode( ) cannot fail with UnicodeEncodeError since if we call it, it means that decode succeeded, i.e. we are encoding ascii characters
My implementation was wrong and confused, true. But is it such a bad idea to throw nice errors that explain in one line what's happening ? I really don't like the idea of having our framework users forced to dig into OUR code to understand what was wrong with THEIR code...
How is r5807 ?
Secondly, throwing a PageNotSaved exception in a function that has nothing to do with a page save (just with a page) is a bit strange...
Well, in the same function, _putPage, we nearly __only__ throw these errors : PageNotSaved several times, and derivated SpamfilterError, EditConflict and LongPageError. The only non-PageNotSaved is LockedPage. Note that I really don't mind what you use as an Error, as long as the message is self-explanatory :)
the only good system is the old system. If needed, catch the UnicodeDecodeException and raise a PageNotSaved exception in the /save/ function.
Not sure, to me _putPage is our only save function, called from both put and async_put. Do you mean catching the errors above, in put and async_put ? If so, I would be quite reluctant to the idea of duplicating the exception catch while catching them once for all in _putPage solves the problem.
Nicolas Dumazet wrote:
"Hopefully, most of the readers here understand that text should be decoded first before passing it to put"
That's the basic of Python Unicode programming. The text should be decoded from which encoding?
If a user starts searching for UnicodeDecodeError, s/he should be enlightened :)
Here is the idea, considering the line : str.encode(self.site().encoding())
- If str is unicode, that line can fail with UnicodeEncodeError. (
e.g. u'人物'.encode('latin-1') ). I want to raise an user friendly error if that happens.
UnicodeEncodeError is precise enough for this. This means "I can convert your string to the encoding your MediaWiki installation uses".
- If str is a string, this line is exactly equivalent to
str.decode('ascii').encode(self.site().encoding()), or shorter, str.decode().encode(self.site().encoding())
And we don't need to play with ascii strings at all. "encode" on this will do no harm.
** str.decode() can fail with UnicodeDecodeError, ( e.g. 'é'.decode() ) - that's what is being tested line 1290, "arg.decode()" . I want to raise another error here, different from the first one.
This is the case where UnicodeDecodeError should be raised. UnicodeDecodeError != UnicodeEncodeError so they are different.
** encode( ) cannot fail with UnicodeEncodeError since if we call it, it means that decode succeeded, i.e. we are encoding ascii characters>
That is redundant.
My implementation was wrong and confused, true. But is it such a bad idea to throw nice errors that explain in one line what's happening ? I really don't like the idea of having our framework users forced to dig into OUR code to understand what was wrong with THEIR code...
How is r5807 ?
Fine, except for catching "UnicodeDecodeError" and providing a "user friendly" text message. We don't need this - see below
If you are more interested in learning about exception theory, please search for "checked exceptions", for example here:
http://radio.weblogs.com/0122027/stories/2003/04/01/JavasCheckedExceptionsWe...
http://web.archive.org/web/20080205032558/http://www.mindview.net/Etc/Discus...
What you are trying to do is to "tunnel" a real exception in a PageNotSaved exception - this is just not necessary in Python.
Well, in the same function, _putPage, we nearly __only__ throw these errors : PageNotSaved several times, and derivated SpamfilterError, EditConflict and LongPageError. The only non-PageNotSaved is LockedPage. Note that I really don't mind what you use as an Error, as long as the message is self-explanatory :)
This is not Java, and we don't do checked exceptions. Nobody forbids _putPage to fail with any exception. Indeed, it should fail with UnicodeDecodeError or UnicodeEncodeError.
the only good system is the old system. If needed, catch the UnicodeDecodeException and raise a PageNotSaved exception in the /save/ function.
Not sure, to me _putPage is our only save function, called from both put and async_put. Do you mean catching the errors above, in put and async_put ? If so, I would be quite reluctant to the idea of duplicating the exception catch while catching them once for all in _putPage solves the problem.
You don't need to catch or declare exceptions! Suppose I have my program:
try: page.put(somestring) except UnicodeDecodeError, e: if my_application_default_encoding: page.put(somestring.encode(my_application_default_encoding)) else: raise except UnicodeEncodeError, e: print u"""Twoja instalacja MediaWiki nie korzysta z utf-8!""".encode("utf-8")
I know _precisely_ what happened. What if I got PageNotSaved? I need to analyze some English-language string instead. What if I want to be user friendly and deliver nicely formatted localized error messages to the user? I can't really precisely differentiate between those situations (unicode error or something completely unrelated).
Actually, when I first started using UTF-8 with pywikipedia, I got "UnicodeDecodeError" in one of my first attempts. Then I new _exactly_ what was the problem - I tried ASCII strings instead of Unicode. I had to dig a bit to learn about Unicode in Python of course.
The text message might be more readable, but if I google for "UnicodeDecodeError" it will much more helpful than using PageNotSaved with a message.
--Marcin
Nicolas Dumazet wrote:
The reason why we we are encoding our strings/unicode objects to the site's encoding is obvious : we don't trust ourselves to think to encode properly each string before each put. And failing to properly encode a string would result in a garbage write. This has always been like this, my recent changes did not introduce that behavior.
This has worked properly in the past for me and for others, once the unicode string is supplied everything works. Even Unicode page titles work perfectly. So, yes, strings _are_ properly encoded on HTTP output.
I introduced the unicode check because a user was not understanding why a UnicodeDecodeError was triggered in the put of : text = open('file_in_utf8_with_non-ascii_chars').read() page.put(text)
Please educate your user about I/O with Python with Unicode characters :-) Adding .decode("filescharset") after read() should help. This is not a problem in pywikipedia, but in the user code.
UnicodeDecodeError is exactly the same thing you try to say with "PageNotSavedError", except that the application cannot properly catch this case. Just like Merlin wrote earlier.
Can we go back to 5801 solution?
--Marcin
pywikipedia-l@lists.wikimedia.org