Case insensitivity shouldn't be a problem for any language, as long as
you do it properly.
Turkish and other languages using dotless i, for example, will need a
special rule - Turkish lowercase dotted i capitalizes to a capital
dotted İ while lowercase undotted ı capitalizes to regular undotted I.
skype: node.ue
On Tue, Jul 28, 2009 at 9:26 AM, Aryeh
Gregor<Simetrical+wikilist(a)gmail.com> wrote:
On Tue, Jul 28, 2009 at 11:53 AM, Paul
Houle<paul(a)ontology2.com> wrote:
I've been looking at the id structure of
dbpedia and wikipedia and
finally found an example where case sensitivity issues really bite.
We should keep in mind that case isn't so clear-cut if you move away
from English, though -- is "groß" the same as "GROSS" and thus the
same as "gross"? How about languages that don't even have bijections
between uppercase and lowercase if you stick to the same dialect?
(I'm pretty sure there are some; don't some language strip diacritics
from uppercase letters?) There's probably some Unicode standard on
normalization with respect to case, but it's not actually so simple in
an international context.
That said, I think case-insensitivity would be a good thing to support
in the long run, optionally, and that it would probably be suitable
for all Wikipedias. Or at least almost all, if there are languages
out there where case insensitivity is a real headache -- hopefully
not, since most languages don't have letter case at all. At any rate
it would be good on enwiki.
But it would require a lot of tedious and error-prone conversion of
old code. Everything tends to assume that a)
$title->getPrefixedText() is what should be displayed to the user, but
b) two titles are equal if and only if their
$title->getPrefixedText()s are equal. Likewise for
$title->getPrefixedDbKey(). Those would need to be systematically and
thoroughly fixed. We'd also have to add a field to the page table or
such to store the normalized form of the title, and fiddle with the
indexes appropriately, and update all other tables to use the
normalized form. A lot of work.
(But at least we could get rid of the silly Text/DbKey distinction
while we're doing this. I've heard recent MySQL versions actually
support storage of ASCII space characters in text fields!)
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l