I hacked up the fulltext search/index code a bit to work on UTF-8
despite MySQL's lack of direct support: a Language::stripForSearch()
function is called to do any necessary mangling of character sets before
we store the indexable version of the text.
For Esperanto, Polish, Russian, Czech and Korean I set it to just fold
the text to lowercase (so search is case insensitive) and then convert
all UTF-8 sequences into hex strings which MySQL won't mistreat.
For Chinese and Japanese, things are a bit more complicated, as there is
no word spacing in the original text but the fulltext search works on
words. For Chinese I just set it to put spaces around every character;
it needs a lot of tweaking, but it sort of works. If you search a single
character it works great, but multi-character sequences don't behave as
expected.
For Japanese, I have it divide up the text at boundaries around chunks
of the same type of character (hiragana, katakana, or kanji), which does
a pretty good first approximation of dividing at the right place. It
could probably use some more work as well. When searching a word/short
phrase that divides across character types (ie, 'furansugo' which mixes
katakana and kanji) results may not be as expected.
-- brion vibber (brion @
pobox.com)