On Tue, Jul 28, 2009 at 11:53 AM, Paul Houlepaul@ontology2.com wrote:
I've been looking at the id structure of dbpedia and wikipedia and finally found an example where case sensitivity issues really bite.
We should keep in mind that case isn't so clear-cut if you move away from English, though -- is "groß" the same as "GROSS" and thus the same as "gross"? How about languages that don't even have bijections between uppercase and lowercase if you stick to the same dialect? (I'm pretty sure there are some; don't some language strip diacritics from uppercase letters?) There's probably some Unicode standard on normalization with respect to case, but it's not actually so simple in an international context.
That said, I think case-insensitivity would be a good thing to support in the long run, optionally, and that it would probably be suitable for all Wikipedias. Or at least almost all, if there are languages out there where case insensitivity is a real headache -- hopefully not, since most languages don't have letter case at all. At any rate it would be good on enwiki.
But it would require a lot of tedious and error-prone conversion of old code. Everything tends to assume that a) $title->getPrefixedText() is what should be displayed to the user, but b) two titles are equal if and only if their $title->getPrefixedText()s are equal. Likewise for $title->getPrefixedDbKey(). Those would need to be systematically and thoroughly fixed. We'd also have to add a field to the page table or such to store the normalized form of the title, and fiddle with the indexes appropriately, and update all other tables to use the normalized form. A lot of work.
(But at least we could get rid of the silly Text/DbKey distinction while we're doing this. I've heard recent MySQL versions actually support storage of ASCII space characters in text fields!)