On Saturday 23 November 2002 07:22, Brion Vibber wrote:
For Japanese, I have it divide up the text at
boundaries around chunks
of the same type of character (hiragana, katakana, or kanji), which does
a pretty good first approximation of dividing at the right place. It
could probably use some more work as well. When searching a word/short
phrase that divides across character types (ie, 'furansugo' which mixes
katakana and kanji) results may not be as expected.
AFAIK (which isn't much, I know hardly any Japanese), Japanese desinences are
written in hiragana and the rest of the word is written in kanji (if Chinese
or Japanese) or katakana (if anything else). So how about splitting wherever
a hiragana is followed by a katakana or kanji? But when a word written in
kanji is followed by another word written in kanji, neither algorithm will
know where to split it.
phma