I've made some fixes to the MySQL search backend for Chinese and other
languages using variants.
Some languages don’t use word spacing, like Chinese and Japanese. To let
the search index know where word boundaries are, we have to internally
insert spaces between some characters:
维基百科 -> 维 基 百 科
Then to add insult to injury, we need to fudge the Unicode characters to
ensure things work reliably with older and newer versions of MySQL:
维 基 百 科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791
For a long time, this word segmentation wasn’t being handled correctly
for Chinese in our default MySQL search backend, so searching for a
multi-character word often gave false matches where the characters were
all present, but not together.
This should now be fixed in r52338: the intermediate query
representation passed to the search backend internally treats your
multi-character Chinese input as a phrase, which will only match actual
adjacent characters:
维基百科 -> +"u8e7bbb4 u8e59fba u8e799be u8e7a791"
Variants for eg Serbian are also now using parens internally so they
should match more usefully.
Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more
demanding, search backend with a separate Java-based engine built around
Apache Lucene. Sometimes we have to remind ourselves that third-party
users will mostly be using the MySQL-based default, and oh boy it still
needs some lovin’! :)
-- brion
Show replies by date