Hello.
Can you explain me in a few words how Wiki engine performs full-text search in UTF-8 encoded articles?
This is a very important problem for me. I have a database in UTF-8. MySQL prior 4.1 doesn't support full-text search in UTF-8 text. Only alpha-version of mysql 4.1 is available at the moment. So I don't want to install it.
I tried to look for the answer in the Wiki sources. But I realized that this would take a rather long time. The only thing I understood is that search keys are somehow stored in the table 'searchindex'.
So can anyone tell me the basic idea how Wiki performs the fulltext search?
Thanks for your time.
Best regards, Alexander Prudnikov.
Alexander Prudnikov wrote:
Hello.
Can you explain me in a few words how Wiki engine performs full-text search in UTF-8 encoded articles?
This is a very important problem for me. I have a database in UTF-8. MySQL prior 4.1 doesn't support full-text search in UTF-8 text. Only alpha-version of mysql 4.1 is available at the moment. So I don't want to install it.
I tried to look for the answer in the Wiki sources. But I realized that this would take a rather long time. The only thing I understood is that search keys are somehow stored in the table 'searchindex'.
So can anyone tell me the basic idea how Wiki performs the fulltext search?
Thanks for your time.
Best regards, Alexander Prudnikov.
The handling depends on the language. The basic UTF-8 handling is to convert to lower case using an internal table, then to encode any non-ASCII characters as hexadecimal using bin2hex(). The Chinese and Japanese language files have special routines to insert spaces into strings, since MySQL uses a word search and those languages don't usually use spaces.
The relevant functions are doUpdate() in includes/SearchUpdate.php, and stripForSearch() in languages/LanguageUtf8.php .
-- Tim Starling
Thanks a lot, Tim. I am glad to hear the answer so soon.
Best regards, Alexander Prudnikov.
Can you explain me in a few words how Wiki engine performs full-text search in UTF-8 encoded articles?
TS> The handling depends on the language. The basic UTF-8 handling is to TS> convert to lower case using an internal table, then to encode any TS> non-ASCII characters as hexadecimal using bin2hex(). The Chinese and TS> Japanese language files have special routines to insert spaces into TS> strings, since MySQL uses a word search and those languages don't TS> usually use spaces.
TS> The relevant functions are doUpdate() in includes/SearchUpdate.php, and TS> stripForSearch() in languages/LanguageUtf8.php .
TS> -- Tim Starling
TS> _______________________________________________ TS> Wikitech-l mailing list TS> Wikitech-l@Wikipedia.org TS> http://mail.wikipedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org