Bugs item #3182761, was opened at 2011-02-15 23:40 Message generated for change (Comment added) made by ganz-ru You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 7 Private: No Submitted By: GanZ (ganz-ru) Assigned to: Nobody/Anonymous (nobody) Summary: Problem with Tibetan script
Initial Comment: Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru)
Date: 2011-02-16 03:46
Message: I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations.
If you can start discussion on mediawiki-tech, please do it.
Thank you.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 03:34
Message: Stripping is done in xmlreader.py:194. Calling strip() seems to remove the U+200B character indeed.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 03:31
Message: Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 03:10
Message: Ok. There are two issues playing a role here.
1) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people. 2) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE. 3) Mediawiki does *not* do this
So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed.
Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 02:28
Message: Version.py: Pywikipedia [http] trunk/pywikipedia (r8948, 2011/02/13, 09:19:56) Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] config-settings: use_api = True use_api_login = True unicode test: ok
I'll be glad to help if you write the test code.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 02:17
Message: I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages.
Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions.
In the meanwhile, could you post the output of version.py?
Thanks.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 02:13
Message: If I did it right output is 4 files: __init__.pyc decoder.pyc encoder.pyc scanner.pyc
Are you needing them? I'm sorry, I'm not the python programmer.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 01:03
Message: I see, indeed.
Could you post the output of
import query print query.json.__file__
for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 00:22
Message: I'sorry. All of them have versions older than 2.6.5.
----------------------------------------------------------------------
Comment By: GanZ (ganz-ru) Date: 2011-02-16 00:20
Message: Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-02-16 00:03
Message: My crystal ball suggests: * Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug * TXiKiBoT is running an old version of pywikipediabot (there is no python version in the edit summary) on python 2.6.5+
Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761...
pywikipedia-bugs@lists.wikimedia.org