I hacked up the fulltext search/index code a bit to work on UTF-8 despite MySQL's lack of direct support: a Language::stripForSearch() function is called to do any necessary mangling of character sets before we store the indexable version of the text.
For Esperanto, Polish, Russian, Czech and Korean I set it to just fold the text to lowercase (so search is case insensitive) and then convert all UTF-8 sequences into hex strings which MySQL won't mistreat.
For Chinese and Japanese, things are a bit more complicated, as there is no word spacing in the original text but the fulltext search works on words. For Chinese I just set it to put spaces around every character; it needs a lot of tweaking, but it sort of works. If you search a single character it works great, but multi-character sequences don't behave as expected.
For Japanese, I have it divide up the text at boundaries around chunks of the same type of character (hiragana, katakana, or kanji), which does a pretty good first approximation of dividing at the right place. It could probably use some more work as well. When searching a word/short phrase that divides across character types (ie, 'furansugo' which mixes katakana and kanji) results may not be as expected.
-- brion vibber (brion @ pobox.com)
On Saturday 23 November 2002 07:22, Brion Vibber wrote:
For Japanese, I have it divide up the text at boundaries around chunks of the same type of character (hiragana, katakana, or kanji), which does a pretty good first approximation of dividing at the right place. It could probably use some more work as well. When searching a word/short phrase that divides across character types (ie, 'furansugo' which mixes katakana and kanji) results may not be as expected.
AFAIK (which isn't much, I know hardly any Japanese), Japanese desinences are written in hiragana and the rest of the word is written in kanji (if Chinese or Japanese) or katakana (if anything else). So how about splitting wherever a hiragana is followed by a katakana or kanji? But when a word written in kanji is followed by another word written in kanji, neither algorithm will know where to split it.
phma
wikitech-l@lists.wikimedia.org