Bugs item #3182761, was opened at 2011-02-15 21:40
Message generated for change (Comment added) made by valhallasw
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=318276…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: GanZ (ganz-ru)
Assigned to: Nobody/Anonymous (nobody)
Summary: Problem with Tibetan script
Initial Comment:
Here is hard edit war:
http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old
python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it
correctly.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 11:19
Message:
Wikimedia bugzilla bug entry:
https://bugzilla.wikimedia.org/show_bug.cgi?id=27446
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 09:28
Message:
Aaand
http://bugs.python.org/issue10567 is related to that.
In essence: bots running < 2.7 were technically doing the wrong thing, but
this did not go noticed as no-one used the interwiki to the tibetan
wikipedia, and all bots did the same wrong thing. Now there are bots
running 2.7+, from the toolserver, and the bug surfaced.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 09:21
Message:
JAn Dudik moved the page, so the problem should be fixed for now. Keeping
this open (it's a bug in pywikipedia, after all).
Related:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license"
for more information.
>> u'\u200b'.strip()
u''
Python 2.7.1 (r271:86832, Jan 4 2011, 13:57:14)
[GCC 4.5.2] on sunos5
Type "help", "copyright", "credits" or "license"
for more information.
>> u'\u200b'.strip()
u'\u200b'
\u200b is technically not whitespace, so strip() probably should not
delete it.
Of course, pwb should not be stripping page titles in the first place.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-16 01:46
Message:
I agree that this symbol in titles is absolutely useless. Not only for
bots, but for usual users too since it can break their copy-paste
operations.
If you can start discussion on mediawiki-tech, please do it.
Thank you.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 01:34
Message:
Stripping is done in xmlreader.py:194. Calling strip() seems to remove the
U+200B character indeed.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 01:31
Message:
Except I don't have move privileges on that wiki. Added a comment on the
user talk page of the guy who created the page.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 01:10
Message:
Ok. There are two issues playing a role here.
1) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes
no sense, other than to annoy people.
2) the XML parser strips spaces around titles, including the 0x200B ZERO
WIDTH SPACE.
3) Mediawiki does *not* do this
So, first of all, I will rename the article, so it no longer has the
0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML
bug, so someone else may fix it. However, due to the fact bots are killing
eachother about it, I suspect this is a small change somewhere in the
python APIs - or default setting that changed.
Lastly, maybe we should broaden the discussion into mediawiki-tech --
should page titles be allowed to have unicode whitespace characters
embedded, especially if they are invisible?
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-16 00:28
Message:
Version.py:
Pywikipedia [http] trunk/pywikipedia (r8948, 2011/02/13, 09:19:56)
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)]
config-settings:
use_api = True
use_api_login = True
unicode test: ok
I'll be glad to help if you write the test code.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 00:17
Message:
I meant the output if you type those lines into the python interpreter -
but I've been poking around some more. It does not seem to be JSON related
- or maybe it is, or maybe it isn't it. I think it has to do with some very
old code called 'getall', which gets batches of pages.
Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm
almost afraid to touch that piece of code. I'll see if I can whip up a test
you can run, though, to confirm my suspicions.
In the meanwhile, could you post the output of version.py?
Thanks.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-16 00:13
Message:
If I did it right output is 4 files:
__init__.pyc
decoder.pyc
encoder.pyc
scanner.pyc
Are you needing them?
I'm sorry, I'm not the python programmer.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-15 23:03
Message:
I see, indeed.
Could you post the output of
import query
print query.json.__file__
for your bot? I cannot reproduce the bug, so I suspect it might be due to
a buggy json package. I'll do some package sniffing to check this further.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-15 22:22
Message:
I'sorry. All of them have versions older than 2.6.5.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-15 22:20
Message:
Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of
them have versions newer than 2.6.5.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-15 22:03
Message:
My crystal ball suggests:
* Wikitanvirbot is running from the toolserver, which has a patched 2.7.1
without unicode bug
* TXiKiBoT is running an old version of pywikipediabot (there is no
python version in the edit summary) on python 2.6.5+
Conclusion: TXiKiBot should be blocked until its owner fixes his/hers
setup.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=318276…