Jan Hidders wrote:
We cannot index UTF-8.
We shouldn't. We should strip down to 7bit U.S. ASCII before indexing. Searching for o should find any occurance of ö, ó or ô. This works great for English, Swedish, Norwegian, Danish, Finnish, and German. I have successfully tried this on other websites before, but I cannot speak for other languages. Of course, the search expression must be stripped in the same way before the search is performed.
Also, in the stripping down, any E following a wovel could be removed, to avoid the confusion between spellings like Gottingen, Goettingen, and Göttingen, and that Danish poet Oehlenschläger.
This sort of search will yield a few hits too many, which is good. I'm not advocating soundex matching here, but soundex could be implemented in the same way.