I hacked up the fulltext search/index code a bit to work on UTF-8 despite MySQL's lack of direct support: a Language::stripForSearch() function is called to do any necessary mangling of character sets before we store the indexable version of the text.
For Esperanto, Polish, Russian, Czech and Korean I set it to just fold the text to lowercase (so search is case insensitive) and then convert all UTF-8 sequences into hex strings which MySQL won't mistreat.
For Chinese and Japanese, things are a bit more complicated, as there is no word spacing in the original text but the fulltext search works on words. For Chinese I just set it to put spaces around every character; it needs a lot of tweaking, but it sort of works. If you search a single character it works great, but multi-character sequences don't behave as expected.
For Japanese, I have it divide up the text at boundaries around chunks of the same type of character (hiragana, katakana, or kanji), which does a pretty good first approximation of dividing at the right place. It could probably use some more work as well. When searching a word/short phrase that divides across character types (ie, 'furansugo' which mixes katakana and kanji) results may not be as expected.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org