On ĵaŭ, 2002-05-23 at 02:21, Jan.Hidders wrote:
On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L.
VIBBER wrote:
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't
do it now, we'll just end up doing it in a few years and it'll be more
painful).
Uh oh. I remember we had that discussion and as I already said then: moving
to UTF-8 breaks the fulltext search.
Fulltext search is already broken in a million ways! It doesn't know
character references (ü, ĉ, į etc), it can't find
partial matches or sounds-likes, it can't find "X" when you search for
"Xs" or "Xs" when you search for "X", it doesn't return
*ANY* results
for words it thinks are too common...
UTF-8 is the least of our problems; it just means that case-folding is a
little trickier (and if we had a decent $*#@%# database, it would take
care of that for us).
This is because the indexing algorithm
assumes we use Latin-1 and bases upon that its decision to collate and
chooses which characters to index. For the German characters this will
probably more or less work out, but if you go you beyond (why else use
UTF-8?) that you will get into severe trouble.
Do you realize how serious this situation is?
Yes, that's *exactly* why we have to break out of the "everything in the
world should be Latin-1, oh and by the way even though we have limited
support for other character sets -- but not the ones YOU need -- you can
only select one single character set for the whole database server!
MWOOHAAHAAHAAA!" rut and make it work for the rest of us too.
In an ideal
world, locale settings would apply (and work correctly and
consistently!) and \w would match everything necessary, but... (PHP 4.1.0
has some sort of special UTF-8 mode for regexps that might or might not be
useful here.)
It's not so much PHP that is the problem, as it is MySQL.
The particular problem I was discussing was the regexps, which were a
PHP problem. MySQL could do all the magic it wanted; if the words don't
get through the regexps in the PHP code they'll never get anywhere in
the database's fulltext search.
Perhaps we should
considering moving to PostgreSQL which really supports UTF-8 and is a better
database anyway (some special pages could be implemented far more
efficiently there).
Oh, it's not like that hasn't been suggested. If anybody knows how to go
about switching to Postgres, I sure as heck wouldn't object. I have no
emotional attachment to MySQL; as far as I know it's only being used
because Magnus was already familiar with it.
-- brion vibber (brion @
pobox.com)