Hi all!
It seems I'm getting some problem with encoding when using pywikipedia. I've reported here below some easy cases when everything is working fine, and when not. I hope some of you can help me in finding the solution.
My working environment: - Windows XP (up-to-date) - Cygwin (CYGWIN_NT-5.1 1.5.24(0.156/4/2)) - Python 2.5 - editing with NotePad++ v.4.1.2 - using pywikipedia as available on sourceforge (snapshot-2007-06-19.zip) - in user-config.py there is console_encoding = 'utf-8' - cygwin is configured to support extra-characters (for instance it shows äöåèé etc.) - my bot is a modification of basic.py (I just modified the part where there is "text = 'Foobar ' + text" with my own code) - basic.py (and its modifications) DOES contain at the beginning: #!/usr/bin/python # -*- coding: utf-8 -*-
Let me also state here that variable "text" (containing the content of the page retrieved from wiki) DOES contain special characters in any of the examples below (accented letters and letters with umlaut, öäåèé for example).
CASE 1 I run basic.py as it is. Everything works as it should: - it adds "Foobar" at the beginning of the page I have passed to it - it shows me the difference - it asks me if I want to accept the changes.
CASE 2 I add in basic.py a special caracter in the text to be added, e.g.: text = 'Fóobar ' + text # (o with an accent) I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
CASE 3 Everything works fine also with regex substitutions, for instance replacing "text = 'Foobar ' + text" with something like p = re.compile('(.*)(some text also with accènts)(.*)') text = p.sub(r'\1 text to be replaced WITHOUT accents', text) I get, as above, the difference (if any) with the new text and it asks me if I want to accept the changes.
CASE 4 Same as case 3 but with special characters in the second line: text = p.sub(r'\1 text to be replaced WITH accents öäåèé', text) I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 55: ordinal not in range(128)
CASE 5 I just replace "text = 'Foobar ' + text" with just print text the following error occurs: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 109: ordinal not in range(128)
CASE 6 I create a separate file without retrieving anything from wikipedia, but just adding the text as a string (text = """text, accénts, etc."""), and implement ALL the examples above. All of them work, that is, no Unicode errors.
Any clue where the problem could be? Thanks in advance.
Cheers, Raffa/Rutja76 (it, en and fi:wiki)