Erik Moeller wrote:
On 4/13/06, Gregory Maxwell gmaxwell@gmail.com wrote:
AFAIR, most string matches in mysql are case insensitive, which would mean that we could have indexed case insensitive matches quickly... but I'm guessing that our use of binary fields for titles (which is required because no version of mysql has complete UTF-8 support) most likely breaks that.
Yes, they are, and yes, it does. Could someone explain what the exact reason is that we're using varchar binary in MediaWiki for page titles? I've been using regular varchars for my WiktionaryZ tables, and so far it seems to work fine with UTF-8. Where exactly does a non-binary varchar break?
It could break string matching, but would definitely break sorting. (Sorting by codepoint may suck, but at least it's predictable.)
More generally, deliberately choosing a non-binary collation which applies to a *different character set* from the one really you're using seems pretty silly. You get unpredictable, incorrect sorting and potentially have strings rejected as invalid.
Might it make sense to use non-binary strings and use binary comparisons instead where case-sensitivity is required?
When MySQL finally reaches the mid-1990s and supports full Unicode, we'll be able to do that.
-- brion vibber (brion @ pobox.com)