Hi all!
It seems I'm getting some problem with encoding when using pywikipedia. I've reported here below some easy cases when everything is working fine, and when not. I hope some of you can help me in finding the solution.
My working environment: - Windows XP (up-to-date) - Cygwin (CYGWIN_NT-5.1 1.5.24(0.156/4/2)) - Python 2.5 - editing with NotePad++ v.4.1.2 - using pywikipedia as available on sourceforge (snapshot-2007-06-19.zip) - in user-config.py there is console_encoding = 'utf-8' - cygwin is configured to support extra-characters (for instance it shows äöåèé etc.) - my bot is a modification of basic.py (I just modified the part where there is "text = 'Foobar ' + text" with my own code) - basic.py (and its modifications) DOES contain at the beginning: #!/usr/bin/python # -*- coding: utf-8 -*-
Let me also state here that variable "text" (containing the content of the page retrieved from wiki) DOES contain special characters in any of the examples below (accented letters and letters with umlaut, öäåèé for example).
CASE 1 I run basic.py as it is. Everything works as it should: - it adds "Foobar" at the beginning of the page I have passed to it - it shows me the difference - it asks me if I want to accept the changes.
CASE 2 I add in basic.py a special caracter in the text to be added, e.g.: text = 'Fóobar ' + text # (o with an accent) I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
CASE 3 Everything works fine also with regex substitutions, for instance replacing "text = 'Foobar ' + text" with something like p = re.compile('(.*)(some text also with accènts)(.*)') text = p.sub(r'\1 text to be replaced WITHOUT accents', text) I get, as above, the difference (if any) with the new text and it asks me if I want to accept the changes.
CASE 4 Same as case 3 but with special characters in the second line: text = p.sub(r'\1 text to be replaced WITH accents öäåèé', text) I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 55: ordinal not in range(128)
CASE 5 I just replace "text = 'Foobar ' + text" with just print text the following error occurs: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 109: ordinal not in range(128)
CASE 6 I create a separate file without retrieving anything from wikipedia, but just adding the text as a string (text = """text, accénts, etc."""), and implement ALL the examples above. All of them work, that is, no Unicode errors.
Any clue where the problem could be? Thanks in advance.
Cheers, Raffa/Rutja76 (it, en and fi:wiki)
Raffaello Tesi wrote:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
These errors occur because you don't save correctly in UTF-8. Try to use a different text editor (e.g. SciTe) when edit not 7-bit ASCII characters.
Francesco Cosoleto
Raffaello Tesi wrote:
CASE 5 I just replace "text = 'Foobar ' + text" with just print text the following error occurs: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 109: ordinal not in range(128)
Use 'wikipedia.output()' instead of 'print'.
CASE 6 I create a separate file without retrieving anything from wikipedia, but just adding the text as a string (text = """text, accénts, etc."""), and implement ALL the examples above. All of them work, that is, no Unicode errors.
With """...""" you can do a raw input of data.
Regards, Francesco Cosoleto
Use u'string' when talking about strings (i.e. character data, without attached encoding); use 'string' when talking about a *bytestring*, i.e. encoded data (where the so called string is just a representation of an array of ints)
Apparently """ is also interpreted as character data.
--valhallasw
(Raffaello: sorry for the double message. I clicked 'reply' instead of 'reply to all' ;))
Merlijn van Deen wrote:
Use u'string' when talking about strings (i.e. character data, without attached encoding); use 'string' when talking about a *bytestring*, i.e. encoded data (where the so called string is just a representation of an array of ints)
Yes, that solves the problem mentioned in case 2. However, the UnicodeDecodeError still occurs in other cases, e.g.
lines = str(text).split('\n')
where text = page.get()
Cheers, Raffaello
lines = str(text).split('\n')
where text = page.get()
str(abc) tries to make a bytestring of the text object. either use
text.split('\n')
or
unicode(text).split('\n')
The first would be the preferred syntax.
And for the regexp case: use ur'text' (or ru'text', I think it was ur'') instead of r''
--valhallasw
Merlijn van Deen wrote:
str(abc) tries to make a bytestring of the text object. either use
text.split('\n')
Yes, that made the trick!
And for the regexp case: use ur'text' (or ru'text', I think it was ur'') instead of r''
Well, it didn't complain before with just r'text', while if I put ur'text' I get UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 0: unexpected end of data while with ru'text' just a SyntaxError
I'm still getting trouble for this string interpolation incipit = textForm % fields with the same UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 14: ordinal not in range(128)
Sorry to bother with this unicode stuff, but I'm really struggling and finding poor references on the net (if you have references, please share, so I might try to find out something by myself :)
Cheers, Raffaello
Well, it didn't complain before with just r'text', while if I put ur'text' I get UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 0: unexpected end of data
Check the contents of text (are they correct unicode?); if that's not the problem then I haven't got a clue
while with ru'text' just a SyntaxError
then ur'text' is the correct syntax ;)
I'm still getting trouble for this string interpolation incipit = textForm % fields with the same UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 14: ordinal not in range(128)
make sure both textFrom and all elements in fields are unicode.
Sorry to bother with this unicode stuff, but I'm really struggling and finding poor references on the net (if you have references, please share, so I might try to find out something by myself :)
http://www.reportlab.com/i18n/python_unicode_tutorial.html might have some hints, but feel free to ask :)
--valhallasw
Merlijn van Deen wrote:
make sure both textFrom and all elements in fields are unicode.
That's the piece of code that I can't make it work yet, due to these unicode problems:
--- def getText(block): lines = block.split('\n') for line in lines: textMatches = extract_text(line) if textMatches: for a in textMatches: textMatch = a return textMatch
pat_county = re.compile(r'''(?x)({{Geobox_contea|)(.*)}}''') extract_text = lambda s: [u[1] for u in re.findall(pat_county, s)] county = getText(text)
pat_inhab = re.compile(r'''(?x)(abitanti=)([^|}]*)''') extract_text = lambda s: [u[1] for u in re.findall(pat_inhab, s)] inhab = getText(text)
textForm = """%(municName)s è un [[Comuni della Svezia|comune]] [[Svezia|svedese]] situato nella [[Contee della Svezia|contea]] di %(countyName)s."""
fields = {'municName':munic, 'numInhab':inhab}
incipit = textForm % fields
p = re.compile('(.*) una [[municipalità]](.*);(.*)[\capoluogo]](.*).')
text = p.sub(incipit, text) ---
The 'for' loop in the 'if' statement is trivial since there is for sure only one occurrence of the text (if there is), but I couldn't make it work without it.
Thanks again Raffaello
argh. Again reply instead of reply to all >_<
textForm = """%(municName)s è un [[Comuni della Svezia|comune]] [[Svezia|svedese]] situato nella [[Contee della Svezia|contea]] di %(countyName)s."""
I suppose textForm needs to be u"""(...)""" then, but I am not completely sure. And instead of using your own regexps to find templates and their parameters, use page.templatesWithParams() and process that. I'm not completely sure what you want to do though.
--valhallasw
Hi
Sorry for my late answer.
Merlijn van Deen wrote:
I suppose textForm needs to be u"""(...)""" then, but I am not completely sure. And instead of using your own regexps to find templates and their parameters, use page.templatesWithParams() and process that. I'm not completely sure what you want to do though.
My plan is:
1. Gather some information (as strings) from an existing page (information contained both in templates and in text of the page) 2. Create a sentence out of those strings. 3. Replace the first sentence of the page with the sentence just created and save the page
replace.py can replace some text with other text, but that's not enough, since I need to pass it the string created in steps 1 and 2. Don't know if a scripts that does that already exists, though.
Cheers, Raffa
My plan is:
- Gather some information (as strings) from an existing page
(information contained both in templates and in text of the page)
Generally: use Page.templatesWithParams(): this gives you a list of tuples (u'template', [u'param1', u'param2' (...)]).; those params can also be of the form 'paramname=param'. For the text in the page itself, you will probably still need a regexp.
- Create a sentence out of those strings.
stringa + stringb + stringc + (...) ;)
- Replace the first sentence of the page with the sentence just created
and save the page
Defining the first template of the page as the part until the first . (which may or may not be accurate):
new = [newsentence] new.extend(page.get().split('.')[1:]) page.put('.'.join(new))
or something like that.
--valhallasw
Merlijn van Deen wrote:
- Create a sentence out of those strings.
stringa + stringb + stringc + (...) ;)
"[...] Simple is better than complex." (from Tim Peters, "The Zen of Python")
Should print it out and try to learn it by heart :)
However, the main problem was the encoding, just adding the magical u (or ur, depending on the case) in front of *every* string made the trick. Thanks for the help!
Cheers, Raffa
pywikipedia-l@lists.wikimedia.org