New subject: [Pywikipedia-l] Problem with non-US-ASCII characters using get.py with output redirected to a file

9 Dec 2008


      Hi, I have problem described in subject. I'd like to do same complex
tasks on cs.wikisource documents, and because I can't write scripts in
Python, I've decided to use pywikipedia framework to download and
upload pages, whilst working with source wikicode will be made by my
own specialized scripts written in Perl.
Now, what's the problem: I'm running get.py to download specified page.
$ python get.py "Page name" > File
but file contains not UTF-8 characters, some letters (á, í, é) seems
to be one-byte characters in ISO-8859-1, another non-US-ASCII letters
(e.g. Czech letters with caron: ě, š, č etc.) are written as
^C{lightyellow}e^C{default} (instead of Czech letter "ě" which is
equivalent to &ecaron; as HTML entity). The same is given by running:
$ python get.py "Page name" | less
Without redirect to a file or pipe, I don't see any problem.
Even worse: if page name contains UTF-8 characters, then file is
empty, because get.py tries to download wrong (almost nonexisting)
page:
$ python get.py 'Jaroslav Vrchlický' > file.txt
Checked for running processes. 1 processes currently running,
including the current process.
Traceback (most recent call last):
  File "get.py", line 33, in <module>
    main()
  File "get.py", line 29, in main
    wikipedia.output(page.get(), toStdout = True)
  File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 668, in get
    self._contents = self._getEditPage(get_redirect = get_redirect,
throttle = throttle, sysop = sysop)
  File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 799,
in _getEditPage
    raise NoPage(self.site(), self.aslink(forceInterwiki = True))
wikipedia.NoPage: (wikisource:cs, u'[[cs:Jaroslav Vrchlick\xc3\xbd]]')
Without redirected output I don't see any problem.
I've tried to use either elder or actual nightly rebuild. Used on
Fedora 7 distribution with LC_ALL=cs_CZ.UTF-8 locale.
Have you any idea, how to simply download source wikicode of specified
page to a file, without no need of subsequent parsing of XML output
(as it is necessary while using api.php)?
Thx.
Milda@cswikisource