New subject: [Pywikipedia-l] UnicodeError: ASCII [decoding, encoding] error: ordinal not in range(128)

1 Aug 2007


      Hi all!
It seems I'm getting some problem with encoding when using pywikipedia. 
I've reported here below some easy cases when everything is working 
fine, and when not. I hope some of you can help me in finding the solution.
My working environment:
- Windows XP (up-to-date)
- Cygwin (CYGWIN_NT-5.1 1.5.24(0.156/4/2))
- Python 2.5
- editing with NotePad++ v.4.1.2
- using pywikipedia as available on sourceforge (snapshot-2007-06-19.zip)
- in user-config.py there is
     console_encoding = 'utf-8'
- cygwin is configured to support extra-characters (for instance it 
shows äöåèé etc.)
- my bot is a modification of basic.py (I just modified the part where 
there is "text = 'Foobar ' + text" with my own code)
- basic.py (and its modifications) DOES contain at the beginning:
     #!/usr/bin/python
     # -*- coding: utf-8  -*-
Let me also state here that variable "text" (containing the content of 
the page retrieved from wiki) DOES contain special characters in any of 
the examples below (accented letters and letters with umlaut, öäåèé for 
example).
CASE 1
I run basic.py as it is. Everything works as it should:
- it adds "Foobar" at the beginning of the page I have passed to it
- it shows me the difference
- it asks me if I want to accept the changes.
CASE 2
I add in basic.py a special caracter in the text to be added, e.g.:
     text = 'Fóobar ' + text  # (o with an accent)
I get the following error:
     UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
     1: ordinal not in range(128)
CASE 3
Everything works fine also with regex substitutions, for instance 
replacing "text = 'Foobar ' + text" with something like
     p = re.compile('(.*)(some text also with accènts)(.*)')
     text = p.sub(r'\1 text to be replaced WITHOUT accents', text)
I get, as above, the difference (if any) with the new text and it asks 
me if I want to accept the changes.
CASE 4
Same as case 3 but with special characters in the second line:
     text = p.sub(r'\1 text to be replaced WITH accents öäåèé', text)
I get the following error:
     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position
     55: ordinal not in range(128)
CASE 5
I just replace "text = 'Foobar ' + text" with just
     print text
the following error occurs:
     UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
     position 109: ordinal not in range(128)
CASE 6
I create a separate file without retrieving anything from wikipedia, but 
just adding the text as a string (text = """text, accénts, etc."""), and 
implement ALL the examples above. All of them work, that is, no Unicode 
errors.
Any clue where the problem could be? Thanks in advance.
Cheers,
Raffa/Rutja76 (it, en and fi:wiki)