We recently got a suggestion via Phabricator[1] to automatically map between hiragana and katakana when searching on English Wikipedia and other wiki projects. As an always-on feature, this isn't difficult to implement, but major commercial search engines (Google.jp, Bing, Yahoo Japan, DuckDuckGo, Goo) don't do that. They give different results when searching for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give different *numbers* of results, seeming to indicate that it's not just re-ordering the same results (say, so that results in the same script are ranked higher).[2] I want to know what they know that I don't!

Does anyone have any thoughts on whether this would be useful (seems that it would) and whether it would cause any problems (it must, or otherwise all the other search engines would do it, right?).

Any idea why it might be different between a Japanese-language wiki and a non-Japanese-language wiki? We often are more aggressive in matching between characters that are not native to a given language--for example, accents on Latin characters are generally ignored on English-language wikis. So it might make sense to merge hiragana and katakana on English-language wikis but not Japanese-language wikis.

Thanks very much for any suggestions or information!

—Trey

[1] https://phabricator.wikimedia.org/T176197

[2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309

Trey Jones

Sr. Software Engineer, Search Platform
Wikimedia Foundation