Chinese-language search fixes - Wikitech-l

23 Jun 2009


      I've made some fixes to the MySQL search backend for Chinese and other 
languages using variants.
Some languages don’t use word spacing, like Chinese and Japanese. To let 
the search index know where word boundaries are, we have to internally 
insert spaces between some characters:
维基百科 -> 维 基 百 科
Then to add insult to injury, we need to fudge the Unicode characters to 
ensure things work reliably with older and newer versions of MySQL:
维 基 百 科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791
For a long time, this word segmentation wasn’t being handled correctly 
for Chinese in our default MySQL search backend, so searching for a 
multi-character word often gave false matches where the characters were 
all present, but not together.
This should now be fixed in r52338: the intermediate query 
representation passed to the search backend internally treats your 
multi-character Chinese input as a phrase, which will only match actual 
adjacent characters:
维基百科 -> +"u8e7bbb4 u8e59fba u8e799be u8e7a791"
Variants for eg Serbian are also now using parens internally so they 
should match more usefully.
Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more 
demanding, search backend with a separate Java-based engine built around 
Apache Lucene. Sometimes we have to remind ourselves that third-party 
users will mostly be using the MySQL-based default, and oh boy it still 
needs some lovin’! :)
-- brion