Bugs item #3182761, was opened at 2011-02-15 21:40 Message generated for change (Comment added) made by valhallasw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 7 Private: No Submitted By: GanZ (ganz-ru) Assigned to: Nobody/Anonymous (nobody) Summary: Problem with Tibetan script
Initial Comment: Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw)
Date: 2011-02-16 09:28
Message: Aaand http://bugs.python.org/issue10567 is related to that.
In essence: bots running < 2.7 were technically doing the wrong thing, but this did not go noticed as no-one used the interwiki to the tibetan wikipedia, and all bots did the same wrong thing. Now there are bots running 2.7+, from the toolserver, and the bug surfaced.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 09:21
Message: JAn Dudik moved the page, so the problem should be fixed for now. Keeping this open (it's a bug in pywikipedia, after all).
Related:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
u'\u200b'.strip()
u''
Python 2.7.1 (r271:86832, Jan 4 2011, 13:57:14) [GCC 4.5.2] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
u'\u200b'.strip()
u'\u200b'
\u200b is technically not whitespace, so strip() probably should not delete it.
Of course, pwb should not be stripping page titles in the first place.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 01:46
Message: I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations.
If you can start discussion on mediawiki-tech, please do it.
Thank you.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 01:34
Message: Stripping is done in xmlreader.py:194. Calling strip() seems to remove the U+200B character indeed.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 01:31
Message: Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 01:10
Message: Ok. There are two issues playing a role here.
1) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people. 2) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE. 3) Mediawiki does *not* do this
So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed.
Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 00:28
Message: Version.py: Pywikipedia [http] trunk/pywikipedia (r8948, 2011/02/13, 09:19:56) Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] config-settings: use_api = True use_api_login = True unicode test: ok
I'll be glad to help if you write the test code.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 00:17
Message: I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages.
Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions.
In the meanwhile, could you post the output of version.py?
Thanks.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 00:13
Message: If I did it right output is 4 files: __init__.pyc decoder.pyc encoder.pyc scanner.pyc
Are you needing them? I'm sorry, I'm not the python programmer.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-15 23:03
Message: I see, indeed.
Could you post the output of
import query print query.json.__file__
for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-15 22:22
Message: I'sorry. All of them have versions older than 2.6.5.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-15 22:20
Message: Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-15 22:03
Message: My crystal ball suggests: * Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug * TXiKiBoT is running an old version of pywikipediabot (there is no python version in the edit summary) on python 2.6.5+
Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761...
pywikipedia-bugs@lists.wikimedia.org