Hi, I have problem described in subject. I'd like to do same complex tasks on cs.wikisource documents, and because I can't write scripts in Python, I've decided to use pywikipedia framework to download and upload pages, whilst working with source wikicode will be made by my own specialized scripts written in Perl.
Now, what's the problem: I'm running get.py to download specified page. $ python get.py "Page name" > File but file contains not UTF-8 characters, some letters (á, í, é) seems to be one-byte characters in ISO-8859-1, another non-US-ASCII letters (e.g. Czech letters with caron: ě, š, č etc.) are written as ^C{lightyellow}e^C{default} (instead of Czech letter "ě" which is equivalent to ě as HTML entity). The same is given by running: $ python get.py "Page name" | less Without redirect to a file or pipe, I don't see any problem.
Even worse: if page name contains UTF-8 characters, then file is empty, because get.py tries to download wrong (almost nonexisting) page: $ python get.py 'Jaroslav Vrchlický' > file.txt Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "get.py", line 33, in <module> main() File "get.py", line 29, in main wikipedia.output(page.get(), toStdout = True) File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 668, in get self._contents = self._getEditPage(get_redirect = get_redirect, throttle = throttle, sysop = sysop) File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 799, in _getEditPage raise NoPage(self.site(), self.aslink(forceInterwiki = True)) wikipedia.NoPage: (wikisource:cs, u'[[cs:Jaroslav Vrchlick\xc3\xbd]]')
Without redirected output I don't see any problem.
I've tried to use either elder or actual nightly rebuild. Used on Fedora 7 distribution with LC_ALL=cs_CZ.UTF-8 locale.
Have you any idea, how to simply download source wikicode of specified page to a file, without no need of subsequent parsing of XML output (as it is necessary while using api.php)? Thx.
Milda@cswikisource
wikipedista Milda wrote:
Hi, I have problem described in subject. I'd like to do same complex tasks on cs.wikisource documents, and because I can't write scripts in Python, I've decided to use pywikipedia framework to download and upload pages, whilst working with source wikicode will be made by my own specialized scripts written in Perl.
$ python get.py 'Jaroslav Vrchlický' > file.txt Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "get.py", line 33, in <module> main() File "get.py", line 29, in main wikipedia.output(page.get(), toStdout = True) File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 668, in get self._contents = self._getEditPage(get_redirect = get_redirect, throttle = throttle, sysop = sysop) File "/home/milos/wiki/pywikipediabot-new/wikipedia.py", line 799, in _getEditPage raise NoPage(self.site(), self.aslink(forceInterwiki = True)) wikipedia.NoPage: (wikisource:cs, u'[[cs:Jaroslav Vrchlick\xc3\xbd]]')
I have tried this example on UTF-8 enabled Solaris (with pl_PL.UTF-8 locale) and it worked for me fine, both stdout and file output.
Can you check what do you get from the following commands?
python -c 'import sys; print sys.stdout.encoding'
python -c 'import config; print config.console_encoding'
in case you get something different than UTF-8 and utf-8 you may try adding
console_encoding=UTF-8
at the end of the pywikipedia config.py file.
--Marcin
Marcin Cieslak wrote:
wikipedista Milda wrote: in case you get something different than UTF-8 and utf-8 you may try adding
console_encoding=UTF-8
at the end of the pywikipedia config.py file.
This line should be
console_encoding="UTF-8"
Sorry for mistake
--Marcin
Marcin Cieslak wrote:
at the end of the pywikipedia config.py file.
console_encoding="UTF-8"
It works fine now, this setting also solved the same problem with pagegenerators.py. Thx!
Bryan Tong Minh wrote: wget http://..../w/index.php?title=....&action=raw
Good workaround, I used it before Marcin's solution.
Best regards
-- Milda
2008/12/9 wikipedista Milda wikipedista.milda@gmail.com:
Have you any idea, how to simply download source wikicode of specified page to a file, without no need of subsequent parsing of XML output (as it is necessary while using api.php)? Thx.