[Pywikipedia-l] [ pywikipediabot-Bugs-2105373 ] replace.py crashes on redirect pages

SourceForge.net noreply at sourceforge.net
Sat Jan 10 18:46:17 UTC 2009


Bugs item #2105373, was opened at 2008-09-11 10:01
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: replace.py crashes on redirect pages

Initial Comment:
replace.py (r5884 Python 2.5.1) crashes on redirect pages?

Getting 60 pages from wikipedia:ru...
Sleeping for 18.4 seconds, 2008-09-11 13:40:55
No changes were necessary in [[Esuvee]]
No changes were necessary in [[Et Cetera (театр)]]
Traceback (most recent call last):
  File "D:\pywikipedia\pagegenerators.py", line 763, in __iter__
    yield loaded_page
GeneratorExit

Traceback (most recent call last):
  File "D:\pywikipedia\replace.py", line 708, in <module>
    main()
  File "D:\pywikipedia\replace.py", line 704, in main
    bot.run()
  File "D:\pywikipedia\replace.py", line 373, in run
    new_text = self.doReplacements(new_text)
  File "D:\pywikipedia\replace.py", line 341, in doReplacements
    allowoverlap=self.allowoverlap)
  File "D:\pywikipedia\wikipedia.py", line 3315, in replaceExcept
    text = text[:match.start()] + replacement + text[match.end():]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)

There is a number of redirect pages following [[Et Cetera (театр)]] - http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=Esuvee&to=Ethernet&namespace=0 - may this be a source of the problem?

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 18:46

Message:
I have further investigated the issue. It seems like the bot always crashes
on pages containing "(число)" in the title (that is "number" in
Russian), e.g. http://ru.wikipedia.org/wiki/221_(число). May someone
figure out what's wrong in this byte sequence? Or the problem is not in the
title but within the page body? I see nothing extraordinary there. 

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 17:40

Message:
actually, it seems like the bot crashes not on any page containing
parentheses in the title, but on the first page with parentheses after
redirect page(s)

in this rage of pages
[http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=208&to=209&namespace=0]
it crashes after "2085 (альбом)". I guess the first "normal" page
with parenteses in the title after a number of redirects - "208
(число)" - is causing the bot's crash.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2009-01-10 17:27

Message:
it crashes either on redirect pages or on pages containg parentheses. what
utf-8 encoding problem might be in redirects? I guess the bot should skip
them.

if you look e.g. at Russian wiki [[2055 год]] is a normal page, and the
following page [[2056 год]] is a redirect
([http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=2055+%D0%B3%D0%BE%D0%B4&to=205+%D0%B3%D0%BE%D0%B4&namespace=0]).
There are another three redirects, then a "normal" page with parentheses
[[205 (число)]].

replace.py crashes right after the first page:

D:\pywikipedia>replace.py -lang:ru -fix:ru_fix -namespace:0 "-start:2055"
Getting 60 pages from wikipedia:ru...
No changes were necessary in [[2055 год]]
Traceback (most recent call last):
  File "D:\pywikipedia\pagegenerators.py", line 759, in __iter__
    yield loaded_page
GeneratorExit

Traceback (most recent call last):
  File "D:\pywikipedia\replace.py", line 733, in <module>     main()
  File "D:\pywikipedia\replace.py", line 729, in main     bot.run()
  File "D:\pywikipedia\replace.py", line 383, in run
    new_text = self.doReplacements(new_text)
  File "D:\pywikipedia\replace.py", line 351, in doReplacements
    allowoverlap=self.allowoverlap)
  File "D:\pywikipedia\wikipedia.py", line 3413, in replaceExcept
    text = text[:match.start()] + replacement + text[match.end():]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1:
ordinal not in range(128)

the bot continues the same way when started with "-start:205 (", so I
think it's parenthesis in the title that most likely causes the crash. can
you please double check and fix this issue?

Pywikipedia [http] trunk/pywikipedia (r6242, Jan 09 2009, 20:23:10)
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]

----------------------------------------------------------------------

Comment By: André Malafaya Baptista (malafaya)
Date: 2008-09-16 13:52

Message:
I believe it has something to do with some UTF-8 encoded string not being
considered as such. But that's as far as my "knowledge" goes.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2105373&group_id=93107



More information about the Pywikipedia-l mailing list