[Pywikipedia-l] [ pywikipediabot-Bugs-2105373 ] replace.py crashes
SourceForge.net
noreply at sourceforge.net
Sat Jan 10 23:18:36 UTC 2009
Bugs item #2105373, was opened at 2008-09-11 06:01
Message generated for change (Comment added) made by russblau
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373&group_id=93107
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: General
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
>Summary: replace.py crashes
Initial Comment:
replace.py (r5884 Python 2.5.1) crashes on redirect pages?
Getting 60 pages from wikipedia:ru...
Sleeping for 18.4 seconds, 2008-09-11 13:40:55
No changes were necessary in [[Esuvee]]
No changes were necessary in [[Et Cetera (театр)]]
Traceback (most recent call last):
File "D:\pywikipedia\pagegenerators.py", line 763, in __iter__
yield loaded_page
GeneratorExit
Traceback (most recent call last):
File "D:\pywikipedia\replace.py", line 708, in <module>
main()
File "D:\pywikipedia\replace.py", line 704, in main
bot.run()
File "D:\pywikipedia\replace.py", line 373, in run
new_text = self.doReplacements(new_text)
File "D:\pywikipedia\replace.py", line 341, in doReplacements
allowoverlap=self.allowoverlap)
File "D:\pywikipedia\wikipedia.py", line 3315, in replaceExcept
text = text[:match.start()] + replacement + text[match.end():]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)
There is a number of redirect pages following [[Et Cetera (театр)]] - http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=Esuvee&to=Ethernet&namespace=0 - may this be a source of the problem?
----------------------------------------------------------------------
>Comment By: Russell Blau (russblau)
Date: 2009-01-10 18:18
Message:
Please try upgrading to SVN version (r6248) and see if the problem occurs
again. If so, please post the command line you were running when the
script crashed.
The Python message indicates that this is a Unicode error, so I do not
think it has anything to do with whether a page is a redirect; it probably
has to do with some of the text in the page body.
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 13:46
Message:
I have further investigated the issue. It seems like the bot always crashes
on pages containing "(число)" in the title (that is "number" in
Russian), e.g. http://ru.wikipedia.org/wiki/221_(число). May someone
figure out what's wrong in this byte sequence? Or the problem is not in the
title but within the page body? I see nothing extraordinary there.
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 12:40
Message:
actually, it seems like the bot crashes not on any page containing
parentheses in the title, but on the first page with parentheses after
redirect page(s)
in this rage of pages
[http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=208&to=209&namespace=0]
it crashes after "2085 (альбом)". I guess the first "normal" page
with parenteses in the title after a number of redirects - "208
(число)" - is causing the bot's crash.
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 12:27
Message:
it crashes either on redirect pages or on pages containg parentheses. what
utf-8 encoding problem might be in redirects? I guess the bot should skip
them.
if you look e.g. at Russian wiki [[2055 год]] is a normal page, and the
following page [[2056 год]] is a redirect
([http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=2055+%D0%B3%D0%BE%D0%B4&to=205+%D0%B3%D0%BE%D0%B4&namespace=0]).
There are another three redirects, then a "normal" page with parentheses
[[205 (число)]].
replace.py crashes right after the first page:
D:\pywikipedia>replace.py -lang:ru -fix:ru_fix -namespace:0 "-start:2055"
Getting 60 pages from wikipedia:ru...
No changes were necessary in [[2055 год]]
Traceback (most recent call last):
File "D:\pywikipedia\pagegenerators.py", line 759, in __iter__
yield loaded_page
GeneratorExit
Traceback (most recent call last):
File "D:\pywikipedia\replace.py", line 733, in <module> main()
File "D:\pywikipedia\replace.py", line 729, in main bot.run()
File "D:\pywikipedia\replace.py", line 383, in run
new_text = self.doReplacements(new_text)
File "D:\pywikipedia\replace.py", line 351, in doReplacements
allowoverlap=self.allowoverlap)
File "D:\pywikipedia\wikipedia.py", line 3413, in replaceExcept
text = text[:match.start()] + replacement + text[match.end():]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1:
ordinal not in range(128)
the bot continues the same way when started with "-start:205 (", so I
think it's parenthesis in the title that most likely causes the crash. can
you please double check and fix this issue?
Pywikipedia [http] trunk/pywikipedia (r6242, Jan 09 2009, 20:23:10)
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
----------------------------------------------------------------------
Comment By: André Malafaya Baptista (malafaya)
Date: 2008-09-16 09:52
Message:
I believe it has something to do with some UTF-8 encoded string not being
considered as such. But that's as far as my "knowledge" goes.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373&group_id=93107
More information about the Pywikipedia-l
mailing list