On ĵaŭ, 2002-05-23 at 02:21, Jan.Hidders wrote:
On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't do it now, we'll just end up doing it in a few years and it'll be more painful).
Uh oh. I remember we had that discussion and as I already said then: moving to UTF-8 breaks the fulltext search.
Fulltext search is already broken in a million ways! It doesn't know character references (ü, ĉ, į etc), it can't find partial matches or sounds-likes, it can't find "X" when you search for "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results for words it thinks are too common...
UTF-8 is the least of our problems; it just means that case-folding is a little trickier (and if we had a decent $*#@%# database, it would take care of that for us).
This is because the indexing algorithm assumes we use Latin-1 and bases upon that its decision to collate and chooses which characters to index. For the German characters this will probably more or less work out, but if you go you beyond (why else use UTF-8?) that you will get into severe trouble.
Do you realize how serious this situation is?
Yes, that's *exactly* why we have to break out of the "everything in the world should be Latin-1, oh and by the way even though we have limited support for other character sets -- but not the ones YOU need -- you can only select one single character set for the whole database server! MWOOHAAHAAHAAA!" rut and make it work for the rest of us too.
In an ideal world, locale settings would apply (and work correctly and consistently!) and \w would match everything necessary, but... (PHP 4.1.0 has some sort of special UTF-8 mode for regexps that might or might not be useful here.)
It's not so much PHP that is the problem, as it is MySQL.
The particular problem I was discussing was the regexps, which were a PHP problem. MySQL could do all the magic it wanted; if the words don't get through the regexps in the PHP code they'll never get anywhere in the database's fulltext search.
Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
Oh, it's not like that hasn't been suggested. If anybody knows how to go about switching to Postgres, I sure as heck wouldn't object. I have no emotional attachment to MySQL; as far as I know it's only being used because Magnus was already familiar with it.
-- brion vibber (brion @ pobox.com)