On sab, 2002-03-09 at 14:21, Lars Aronsson wrote:
Jan Hidders wrote:
We cannot index UTF-8.
We shouldn't. We should strip down to 7bit U.S. ASCII before indexing. Searching for o should find any occurance of ö, ó or ô. This works great for English, Swedish, Norwegian, Danish, Finnish, and German. I have successfully tried this on other websites before, but I cannot speak for other languages. Of course, the search expression must be stripped in the same way before the search is performed.
That's only relevant for accented Latin characters, obviously. Hebrew, Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to be retained and searchable. (However we can similarly fold together cases and accents for Greek, perhaps final/medial forms for Greek, Hebrew, and Arabic, and possibly katakana/hiragana for Japanese.)
So yes, we need to index UTF-8 if we're using it.
Also, in the stripping down, any E following a wovel could be removed, to avoid the confusion between spellings like Gottingen, Goettingen, and Göttingen, and that Danish poet Oehlenschläger.
This sort of search will yield a few hits too many, which is good. I'm not advocating soundex matching here, but soundex could be implemented in the same way.
I have no objection to the above. Would match potato/potatoe, too. :)
-- brion vibber (brion @ pobox.com)