On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't
do it now, we'll just end up doing it in a few years and it'll be more
painful).
Uh oh. I remember we had that discussion and as I already said then: moving
to UTF-8 breaks the fulltext search. This is because the indexing algorithm
assumes we use Latin-1 and bases upon that its decision to collate and
chooses which characters to index. For the German characters this will
probably more or less work out, but if you go you beyond (why else use
UTF-8?) that you will get into severe trouble.
Do you realize how serious this situation is?
In an ideal world, locale settings would apply (and
work correctly and
consistently!) and \w would match everything necessary, but... (PHP 4.1.0
has some sort of special UTF-8 mode for regexps that might or might not be
useful here.)
It's not so much PHP that is the problem, as it is MySQL. Perhaps we should
considering moving to PostgreSQL which really supports UTF-8 and is a better
database anyway (some special pages could be implemented far more
efficiently there).
-- Jan Hidders