On Saturday 23 November 2002 07:22, Brion Vibber wrote:
For Japanese, I have it divide up the text at boundaries around chunks of the same type of character (hiragana, katakana, or kanji), which does a pretty good first approximation of dividing at the right place. It could probably use some more work as well. When searching a word/short phrase that divides across character types (ie, 'furansugo' which mixes katakana and kanji) results may not be as expected.
AFAIK (which isn't much, I know hardly any Japanese), Japanese desinences are written in hiragana and the rest of the word is written in kanji (if Chinese or Japanese) or katakana (if anything else). So how about splitting wherever a hiragana is followed by a katakana or kanji? But when a word written in kanji is followed by another word written in kanji, neither algorithm will know where to split it.
phma