https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
Web browser: --- Bug ID: 55246 Summary: Problem with Tibetan script Product: Pywikibot Version: unspecified Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: legoktm.wikipedia@gmail.com Classification: Unclassified Mobile Platform: ---
Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1295/ Reported by: ganz-ru Created on: 2011-02-15 20:40:15 Subject: Problem with Tibetan script Original description: Here is hard edit war: http://en.wikipedia.org/w/index.php?title=Podolsk&action=history . Bots with the old python version add incorrect tibetan interwiki. And bot with version 2.7.1 do it correctly.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #1 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- My crystal ball suggests: * Wikitanvirbot is running from the toolserver, which has a patched 2.7.1 without unicode bug * TXiKiBoT is running an old version of pywikipediabot (there is no python version in the edit summary) on python 2.6.5+
Conclusion: TXiKiBot should be blocked until its owner fixes his/hers setup.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #2 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Same problem have many other bots: my bot, LucienBOT, VolkovBot. All of them have versions newer than 2.6.5.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #3 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- I'sorry. All of them have versions older than 2.6.5.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #4 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- I see, indeed.
Could you post the output of
import query print query.json.__file__
for your bot? I cannot reproduce the bug, so I suspect it might be due to a buggy json package. I'll do some package sniffing to check this further.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #5 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- If I did it right output is 4 files: __init__.pyc decoder.pyc encoder.pyc scanner.pyc
Are you needing them? I'm sorry, I'm not the python programmer.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #6 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- I meant the output if you type those lines into the python interpreter - but I've been poking around some more. It does not seem to be JSON related - or maybe it is, or maybe it isn't it. I think it has to do with some very old code called 'getall', which gets batches of pages.
Sigh. I would very much like to say: "bad luck, try the rewrite" - I'm almost afraid to touch that piece of code. I'll see if I can whip up a test you can run, though, to confirm my suspicions.
In the meanwhile, could you post the output of version.py?
Thanks.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #7 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- - **priority**: 5 --> 7
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #8 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Version.py: Pywikipedia [http] trunk/pywikipedia (r8948, 2011/02/13, 09:19:56) Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] config-settings: use_api = True use_api_login = True unicode test: ok
I'll be glad to help if you write the test code.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #9 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Ok. There are two issues playing a role here.
1) the 'correct' page name ends in a 0x200B ZERO WIDTH SPACE. This makes no sense, other than to annoy people. 2) the XML parser strips spaces around titles, including the 0x200B ZERO WIDTH SPACE. 3) Mediawiki does *not* do this
So, first of all, I will rename the article, so it no longer has the 0x200B ZERO WIDTH SPACE in the title. I will see if I can pinpoint the XML bug, so someone else may fix it. However, due to the fact bots are killing eachother about it, I suspect this is a small change somewhere in the python APIs - or default setting that changed.
Lastly, maybe we should broaden the discussion into mediawiki-tech -- should page titles be allowed to have unicode whitespace characters embedded, especially if they are invisible?
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #10 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Except I don't have move privileges on that wiki. Added a comment on the user talk page of the guy who created the page.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #11 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Stripping is done in xmlreader.py:194. Calling strip() seems to remove the U+200B character indeed.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #12 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- I agree that this symbol in titles is absolutely useless. Not only for bots, but for usual users too since it can break their copy-paste operations.
If you can start discussion on mediawiki-tech, please do it.
Thank you.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #13 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- JAn Dudik moved the page, so the problem should be fixed for now. Keeping this open (it's a bug in pywikipedia, after all).
Related:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'\u200b'.strip() u''
Python 2.7.1 (r271:86832, Jan 4 2011, 13:57:14) [GCC 4.5.2] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> u'\u200b'.strip() u'\u200b'
\u200b is technically not whitespace, so strip() probably should not delete it.
Of course, pwb should not be stripping page titles in the first place.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #14 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Aaand http://bugs.python.org/issue10567 is related to that.
In essence: bots running < 2.7 were technically doing the wrong thing, but this did not go noticed as no-one used the interwiki to the tibetan wikipedia, and all bots did the same wrong thing. Now there are bots running 2.7+, from the toolserver, and the bug surfaced.
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #15 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Wikimedia bugzilla bug entry: https://bugzilla.wikimedia.org/show%5C_bug.cgi?id=27446
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #16 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- - **Group**: --> confirmed - **Priority**: 7 --> 2
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://sourceforge.net/p/p | |ywikipediabot/bugs/1295
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
--- Comment #17 from Merlijn van Deen valhallasw@arctus.nl --- *** Bug 55227 has been marked as a duplicate of this bug. ***
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Priority|Unprioritized |Low CC| |valhallasw@arctus.nl
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
Andre Klapper aklapper@wikimedia.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |i18n Summary|Problem with Tibetan script |Problem with Tibetan / | |Khmer script
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
Merlijn van Deen valhallasw@arctus.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|i18n | Summary|Problem with Tibetan / |Problem with 0x200B ZERO |Khmer script |WIDTH SPACE in page titles
https://bugzilla.wikimedia.org/show_bug.cgi?id=55246
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jayvdb@gmail.com See Also| |https://bugzilla.wikimedia. | |org/show_bug.cgi?id=27446
pywikipedia-bugs@lists.wikimedia.org