[Pywikipedia-l] Problem with non-US-ASCII characters using get.py with output redirected to a file

9 Dec 2008

Hi, I have problem described in subject. I'd like to do same complex
tasks on cs.wikisource documents, and because I can't write scripts in
Python, I've decided to use pywikipedia framework to download and
upload pages, whilst working with source wikicode will be made by my
own specialized scripts written in Perl.

Now, what's the problem: I'm running get.py to download specified page.
$ python get.py "Page name" > File
but file contains not UTF-8 characters, some letters (á, í, é) seems
to be one-byte characters in ISO-8859-1, another non-US-ASCII letters
(e.g. Czech letters with caron: ě, š, č etc.) are written as
^C{lightyellow}e^C{default} (instead of Czech letter "ě" which is
equivalent to &ecaron; as HTML entity). The same is given by running:
$ python get.py "Page name" | less
Without redirect to a file or pipe, I don't see any problem.

Even worse: if page name contains UTF-8 characters, then file is
empty, because get.py tries to download wrong (almost nonexisting)
page:
$ python get.py 'Jaroslav Vrchlický' > file.txt
Checked for running processes. 1 processes currently running,
including the current process.
Traceback (most recent call last):
  File "get.py", line 33, in <module>
    main()
  File "get.py", line 29, in main
    wikipedia.output(page.get(), toStdout = True)
  File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 668, in get
    self._contents = self._getEditPage(get_redirect = get_redirect,
throttle = throttle, sysop = sysop)
  File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 799,
in _getEditPage
    raise NoPage(self.site(), self.aslink(forceInterwiki = True))
wikipedia.NoPage: (wikisource:cs, u'[[cs:Jaroslav Vrchlick\xc3\xbd]]')

Without redirected output I don't see any problem.

I've tried to use either elder or actual nightly rebuild. Used on
Fedora 7 distribution with LC_ALL=cs_CZ.UTF-8 locale.

Have you any idea, how to simply download source wikicode of specified
page to a file, without no need of subsequent parsing of XML output
(as it is necessary while using api.php)?
Thx.

Milda@cswikisource

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[Pywikipedia-l] Problem with non-US-ASCII characters using get.py with output redirected to a file