Bugs item #3081100, was opened at 2010-10-04 19:53 Message generated for change (Comment added) made by sf-robot You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3081100...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: interwiki Group: None
Status: Closed
Resolution: Wont Fix Priority: 7 Private: No Submitted By: Grimlock (grimlockfr) Assigned to: xqt (xqt) Summary: Unicode bug: some page titles are mangled
Initial Comment: Pywikipedia [http] trunk/pywikipedia (r8602, 2010/10/04, 19:33:48) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] config-settings: use_api = True use_api_login = Tru
My interwiki bot on Wikipedia (using interwiki.py) can not identify correctly the interwiki link to hi, and, as a consequence, the link, which is identified as a bad one, is removed when I use -cleanup option (see here http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerberg&action=history... for an example). It appears that one or more characters are misunderstood.
----------------------------------------------------------------------
Comment By: SourceForge Robot (sf-robot)
Date: 2011-07-17 14:20
Message: This Tracker item was closed automatically by the system. It was previously set to a Pending status, and the original submitter did not respond within 14 days (the time period specified by the administrator of this Tracker).
----------------------------------------------------------------------
Comment By: xqt (xqt) Date: 2011-07-03 13:26
Message: Python 2.7.2 has been released at Sun, 12 June 2011. This release does no longer trigger unicode bug 3081100, which happened for characters with multiple accents (for example on hak-, hi-, cdo- and sa-wiki). I guess it is highly recommended to migrate to this new release if the local version has this bug.
Could we close this tracker?
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-03-16 09:39
Message: I cannot edit details, but I have edited the summary to be a bit more descriptive.
----------------------------------------------------------------------
Comment By: Nemo (nemobis) Date: 2011-03-16 08:35
Message: Thank you. Could you please make the bug subject more descriptive? Even reading all comments I wasn't able to understand completely, and it would be better if bot runners, who are sent to this bug by interwiki.py, could understand what's the problem and take the necessary measures (e.g. not using -force or -cleanup, I suppose). Thank you very much!
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2011-03-16 08:27
Message: It happens for any page title where the (correct) mediawiki unicode normalization does not equal the (incorrect) python normalization. As a general guideline, this only happens for characters with multiple accents (say, 3 or so) - this does not only happen for hi:, though!
I think most latin and cyrillic character sets generally are safe. For others, I have no idea - we have had reports for several languages.
----------------------------------------------------------------------
Comment By: Nemo (nemobis) Date: 2011-03-16 08:10
Message: Does this bug affect other languages as well or is it safe to use pywikipedia with this problem if you don't touch hi links?
----------------------------------------------------------------------
Comment By: Grimlock (grimlockfr) Date: 2010-11-02 16:03
Message: I used Python 2.7 when I discovered this bug. The bug is not fixed in 2.7 (or in all 2.7 distributions ..)
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2010-11-02 14:47
Message: Just a quick update: upstream has confirmed this is a bug in the python library. It should get fixed in 2.7 and 3.2, but it is not clear yet whether 2.6.6 will have the fix included.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2010-10-30 15:43
Message: Reported to the python developers: http://bugs.python.org/issue10254
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2010-10-30 14:52
Message: C# test code: http://pastebin.ca/1977261 This does not show this regression. The C# library does not show PR29 issues.
I will file a bug with the python developers about this shortly.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2010-10-27 21:16
Message: One last comment: the problem does not appear in python < 2.6.5. Consider using an older python version if you work on wikimedia sites.
Added warning in r8687.
----------------------------------------------------------------------
Comment By: Merlijn S. van Deen (valhallasw) Date: 2010-10-27 20:54
Message: The last comments were also mine.
Mediawiki does not show problems related to PR29:
<?php include_once('UtfNormal.php');
print bin2hex("\xe0\xad\x87\xcc\x80\xe0\xac\xbe") . "\n"; print bin2hex(UtfNormal::cleanUp("\xe0\xad\x87\xcc\x80\xe0\xac\xbe")) . "\n";
returns the expected
e0ad87cc80e0acbe e0ad87cc80e0acbe
where no information loss is happening. This means it might be a bug introduced in the fix for pr29 in unicodedata.c.
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody) Date: 2010-10-27 20:36
Message: Probably related to http://svn.python.org/view/python/branches/release26-maint/Modules/unicodeda... , and hence http://bugs.python.org/issue1054943# and http://www.unicode.org/review/pr-29.html
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody) Date: 2010-10-27 20:22
Message: Okay, this seems to be a python2.6/2.7 or mediawiki bug. It is related to normalizing UTF-8 strings.
Check out the following: (on py27) Python 2.7 (r27:82500, Aug 5 2010, 04:28:45) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import unicodedata unicodedata.normalize('NFC', u'\u092e\u093e\u0930\u094d\u0915
\u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917') == u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917' False
(on py26):
valhallasw@willow:~/src/pywikipedia-svn$ python2.6 Python 2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import unicodedata unicodedata.normalize('NFC', u'\u092e\u093e\u0930\u094d\u0915
\u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917') == u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917' True
----------------------------------------------------------------------
Comment By: tjmoel (tjmoel) Date: 2010-10-22 21:34
Message: Hi, my bot still make the mistakes http://id.wikipedia.org/w/index.php?title=Archimedes&action=historysubmi...
Any idea on how to solve ?? Thanks
----------------------------------------------------------------------
Comment By: xqt (xqt) Date: 2010-10-12 07:10
Message: Some bots are still involved to this bug: http://de.wikipedia.org/wiki/Spezial:Missbrauchsfilter-Logbuch?title=Spezial...
----------------------------------------------------------------------
Comment By: DJSasso (djsasso) Date: 2010-10-07 19:02
Message: Nevermind...I just noticed that you made a change to not remove hi links in autonomous mode.
----------------------------------------------------------------------
Comment By: DJSasso (djsasso) Date: 2010-10-07 18:38
Message: I should note this morning I updated to the most recent build and have not seen it since. And its been about 6 hours now since then. So it may have fixed itself in the most recent build. Or I may have just been lucky and not had any hi links gets mistaken in that time.
----------------------------------------------------------------------
Comment By: DJSasso (djsasso) Date: 2010-10-07 18:21
Message: Yeah look at my edits on de. I reverted a bunch of my bots changes.
http://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/Djsasso
----------------------------------------------------------------------
Comment By: xqt (xqt) Date: 2010-10-07 16:35
Message: Most problems came from SassoBot, MastiBot, User:ChuispastonBot, VolkowBot, see http://de.wikipedia.org/wiki/Wikipedia:Bots/Notizen#Interwiki-Probleme_mit_h...
With actual py version deleting of hi-links is stopped. Well I'll investigate your hint. Do you have some examples for me.
----------------------------------------------------------------------
Comment By: DJSasso (djsasso) Date: 2010-10-07 12:26
Message: In doing some cleanup of my bots edits on one wiki. I have seen atleast 4 other bots doing this recently. So there is clearly an issue somewhere. I was running the new -cleanup option so maybe that is what causes it.
----------------------------------------------------------------------
Comment By: DJSasso (djsasso) Date: 2010-10-07 10:33
Message: It is doing it for me as well. Has been for the last few days, but seeing as other bot seemed to fix it immediately I didn`t think it was a big issue or was maybe my machine. So I was trying to figure it out on my own. But if its happening to others its clearly not just my machine.
----------------------------------------------------------------------
Comment By: xqt (xqt) Date: 2010-10-05 13:17
Message: I found this bug this morning but now it works as expected.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3081100...
pywikipedia-bugs@lists.wikimedia.org