Bugs item #2105373, was opened at 2008-09-11 10:01 Message generated for change (Comment added) made by nobody You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: replace.py crashes on redirect pages
Initial Comment: replace.py (r5884 Python 2.5.1) crashes on redirect pages?
Getting 60 pages from wikipedia:ru... Sleeping for 18.4 seconds, 2008-09-11 13:40:55 No changes were necessary in [[Esuvee]] No changes were necessary in [[Et Cetera (театр)]] Traceback (most recent call last): File "D:\pywikipedia\pagegenerators.py", line 763, in __iter__ yield loaded_page GeneratorExit
Traceback (most recent call last): File "D:\pywikipedia\replace.py", line 708, in <module> main() File "D:\pywikipedia\replace.py", line 704, in main bot.run() File "D:\pywikipedia\replace.py", line 373, in run new_text = self.doReplacements(new_text) File "D:\pywikipedia\replace.py", line 341, in doReplacements allowoverlap=self.allowoverlap) File "D:\pywikipedia\wikipedia.py", line 3315, in replaceExcept text = text[:match.start()] + replacement + text[match.end():] UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)
There is a number of redirect pages following [[Et Cetera (театр)]] - http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%... - may this be a source of the problem?
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody) Date: 2009-01-10 17:27
Message: it crashes either on redirect pages or on pages containg parentheses. what utf-8 encoding problem might be in redirects? I guess the bot should skip them.
if you look e.g. at Russian wiki [[2055 год]] is a normal page, and the following page [[2056 год]] is a redirect ([http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%...]). There are another three redirects, then a "normal" page with parentheses [[205 (число)]].
replace.py crashes right after the first page:
D:\pywikipedia>replace.py -lang:ru -fix:ru_fix -namespace:0 "-start:2055" Getting 60 pages from wikipedia:ru... No changes were necessary in [[2055 год]] Traceback (most recent call last): File "D:\pywikipedia\pagegenerators.py", line 759, in __iter__ yield loaded_page GeneratorExit
Traceback (most recent call last): File "D:\pywikipedia\replace.py", line 733, in <module> main() File "D:\pywikipedia\replace.py", line 729, in main bot.run() File "D:\pywikipedia\replace.py", line 383, in run new_text = self.doReplacements(new_text) File "D:\pywikipedia\replace.py", line 351, in doReplacements allowoverlap=self.allowoverlap) File "D:\pywikipedia\wikipedia.py", line 3413, in replaceExcept text = text[:match.start()] + replacement + text[match.end():] UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)
the bot continues the same way when started with "-start:205 (", so I think it's parenthesis in the title that most likely causes the crash. can you please double check and fix this issue?
Pywikipedia [http] trunk/pywikipedia (r6242, Jan 09 2009, 20:23:10) Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
----------------------------------------------------------------------
Comment By: André Malafaya Baptista (malafaya) Date: 2008-09-16 13:52
Message: I believe it has something to do with some UTF-8 encoded string not being considered as such. But that's as far as my "knowledge" goes.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373...