On sab, 2002-03-09 at 18:50, Lars Aronsson wrote:
Brion L. VIBBER wrote:
That's only relevant for accented Latin characters, obviously. Hebrew, Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to be retained and searchable.
Are we talking about Greek/Hebrew characters in the English/German Wikipedia now? I think users of the English/German Wikipedia won't have Greek/Hebrew keyboards,
Excepting Greeks and Israelis, obviously. ;)
so ASCII searching would do just fine.
But why bother creating a special separate ASCII-only search, when the non-Latin code is necessary for other languages and we're using a unified character set?
Why *shouldn't* I be able to search for the occasional Greek, Hebrew, or Japanese word in the original spelling on the English wikipedia, if we allow people to put them in in the first place?
I have no idea how to implement search in the Greek/Hebrew Wikipedia.
As stated above: do whatever accent/case/other equivalent conversion is necessary (exactly as you propose for Latin characters), and perform some conversion so that MySQL doesn't reject the UTF-8 non-ascii characters as word separators (in an ideal world, we'd just configure MySQL to understand UTF-8; otherwise, replacing raw bytes with hex codes should work fine).
-- brion vibber (brion @ pobox.com)