On Thu, May 23, 2002 at 03:42:59AM -0700, Brion L. VIBBER wrote:
On ĵaÅ, 2002-05-23 at 02:21, Jan.Hidders wrote:
Uh oh. I remember we had that discussion and as I already said then:
moving to UTF-8 breaks the fulltext search.
Fulltext search is already broken in a million ways! It doesn't know
character references (ü, ĉ, į etc), it can't find
partial matches or sounds-likes, it can't find "X" when you search for
"Xs" or "Xs" when you search for "X", it doesn't return
*ANY* results
for words it thinks are too common...
All true, although the last problem will be solved when we move to the new
MySQL and use the boolean search there. But at least there was a
well-defined semantics: if a word shows up in the edit-text the search will
find it. This is now no longer the case and no clever PHP programming can
solve this.
UTF-8 is the least of our problems; it just means that
case-folding is a
little trickier (and if we had a decent $*#@%# database, it would take
care of that for us).
Case folding is not the only problem. The problem is that the fulltext index
does not index certain characters above 128. That means that words that
contain multibyte characters that are represented using such characters will
not be indexed. That's a bit harder to explain to the users than the
previous problems.
It's not
so much PHP that is the problem, as it is MySQL.
The particular problem I was discussing was the regexps, which were a
PHP problem. MySQL could do all the magic it wanted; if the words don't
get through the regexps in the PHP code they'll never get anywhere in
the database's fulltext search.
I know, but what I am saying is that this is something we can solve with PHP
programming. But if MySQL doesn't index it then we can we can let through all
we want but it won't turn up in the search results. Period.
Perhaps we
should considering moving to PostgreSQL which really supports
UTF-8 and is a better database anyway (some special pages could be
implemented far more efficiently there).
Oh, it's not like that hasn't been suggested. If anybody knows how to go
about switching to Postgres, I sure as heck wouldn't object.
Unfortunately, I only have very limited hands-on experience with PostgreSQL
and no experience with its fulltext indexing. From the manuals I gather it
does have such a thing but it takes some configuring. On the up-side it
seems far more configurable and you can in fact change the parser that looks
for which words to index, how to collate, et cetera. It was written by
Russians, I believe, who are generally more aware of character set problems
anyway.
But the problem is that I don't have the time to research this; next month I
have a conference and I have two dead-lines for articles coming up.
-- Jan Hidders