Jan Hidders wrote:
We cannot index UTF-8.
We shouldn't. We should strip down to 7bit U.S. ASCII before
indexing. Searching for o should find any occurance of ö, ó or ô.
This works great for English, Swedish, Norwegian, Danish, Finnish, and
German. I have successfully tried this on other websites before, but
I cannot speak for other languages. Of course, the search expression
must be stripped in the same way before the search is performed.
Also, in the stripping down, any E following a wovel could be removed,
to avoid the confusion between spellings like Gottingen, Goettingen,
and Göttingen, and that Danish poet Oehlenschläger.
This sort of search will yield a few hits too many, which is good.
I'm not advocating soundex matching here, but soundex could be
implemented in the same way.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/