On Fri, May 13, 2011 at 7:57 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
Doesn't look that bad...
- Some arcane maintenance scripts.
- Some .js that can't interact with Title working with urls.
- The expected User, Title, Parser, file related, etc... core api stuff
that's easy to tweak.
- Some hardcoded stuff for namespaces which could be improved, but
actually isn't all that applicable to what we're trying to fix.
- Some special pages cleaning up inputs where we might want to provide
something inside Title for that.
Except that there are who knows how many other places in the code that make such assumptions but aren't so easily found by searching.
On Fri, May 13, 2011 at 11:33 PM, Andrew Dunbar hippytrail@gmail.com wrote:
I'm almost positive Azeri has the same dotless i issue and perhaps some of the other Turkic languages of Central Asia. One solution is to do accent/diacritic normalization too as part of the canonicalization.
The dotless-i issue affects "Turkic (Turkish/Azerbaijani)" text, according to http://userguide.icu-project.org/transforms/casemappings. This is a well-studied issue with existing standards, and we're not going to do better than the Unicode Consortium has come up with.
You cannot fix the problem by doing accent/diacritic normalization. "i" and "I" are the same letter in English but different letters in Turkish. You cannot get around that. We'd need to have a separate case-folding algorithm for Turkish wikis, or make them use one that's incorrect for their language.