I've made some fixes to the MySQL search backend for Chinese and other languages using variants.
Some languages don’t use word spacing, like Chinese and Japanese. To let the search index know where word boundaries are, we have to internally insert spaces between some characters:
维基百科 -> 维 基 百 科
Then to add insult to injury, we need to fudge the Unicode characters to ensure things work reliably with older and newer versions of MySQL:
维 基 百 科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791
For a long time, this word segmentation wasn’t being handled correctly for Chinese in our default MySQL search backend, so searching for a multi-character word often gave false matches where the characters were all present, but not together.
This should now be fixed in r52338: the intermediate query representation passed to the search backend internally treats your multi-character Chinese input as a phrase, which will only match actual adjacent characters:
维基百科 -> +"u8e7bbb4 u8e59fba u8e799be u8e7a791"
Variants for eg Serbian are also now using parens internally so they should match more usefully.
Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more demanding, search backend with a separate Java-based engine built around Apache Lucene. Sometimes we have to remind ourselves that third-party users will mostly be using the MySQL-based default, and oh boy it still needs some lovin’! :)
-- brion
wikitech-l@lists.wikimedia.org