On 4/14/06, Erik Moeller eloquence@gmail.com wrote:
On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
[snip]
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
It is my understanding that MySQL doesn't correctly handle UTF-8 where the characters are outside of the basic multilingual plane (BMP). These are the characters which can not be represented by UCS-16 (i.e. more then two bytes in the UTF-16 representation) and this includes a number of characters which are required for complete support of some asian languages.
I'm not sure how it fails (does it munge, or warn, .. since it's mysql I'm sure it doesn't reject bad data...), since I haven't tried it personally but thats where I understood the current issue was.
PostgreSQL had simmlar issues with non-bmp characters until version 8 (prior to version 8, PG would reject inserts with most non-bmp characters as invalid).
There are other issues as well.. for example, last I checked on enwiki there were some titles (and page/image links) which contained completely invalid characters... (back from when enwiki was latin1 and more permissive about binary input not so long ago). I see that a number of bad pages have been moved to broken/whatever .. so perhaps thats been cleaned up. ... a switch to utf-8 might potentially run into problems with this pages, depending on how strict mysqls input validation is (historically not strict at all... but...), I know it's always caused me troubles importing into postgresql (because in PG I don't use binary fields for mediawiki imported data, and PG's input validation is fairly strict).