SUMMARY: The Search Platform team (formerly part of Discovery) is planning
to fix a long-standing search bug on many wiki projects by disabling the
code in CirrusSearch that re-uses the “fallback” languages (which are
specified for user interface or system messages) for the language analysis
modules (which are used to index words in search). Deployment is planned to
start the week of October 9, 2017.
Messaging fallbacks specify what language to show a message in when there
is no message available in the language of a given wiki. A language
analysis module is language-specific software that processes text to
improve searching—so that, for example, searching for a given word will
find related forms of that word, like "hope, hopes, hoping, hoped" or
"resume, resumé, résumé" on English-language wikis.
Fallback languages for system messages make sense for historical and
cultural reasons—a reader of the Chechen Wikipedia is more likely to
understand a user interface or system message in Russian than in French,
Greek, Hindi, Italian, or Japanese—but the fallbacks don't necessarily make
any linguistic sense. Chechen and Russian, for example, are from unrelated
language families; while the languages have undoubtedly influenced one
another, their grammars are completed different.
We will deploy the software change that disables using messaging fallbacks
for language analysis fallbacks in about two weeks (targeting the week of
October 9, 2017), with any cross-language analysis exceptions explicitly
configured in a new manner. Changes will not immediately happen to all
affected wikis because each wiki in each language will need to be
re-indexed, which is a separate process that takes time. There may also be
other delays caused by Elasticsearch upgrades or other changes that need
You can also track progress of the tasks on Phabricator or read more,
see examples, and get the full list of languages affected on MediaWiki.
Sr. Software Engineer, Search Platform
TL;DR: Our recent elasticsearch cluster restart did not go as planned.
Most important lesson learned: we did not understand the recovery
Yesterday, we did a cold restart of the elasticsearch / cirrus eqiad
cluster. This restart did not go as planned. It did not generate any
user facing impact, since we moved all the traffic to codfw before the
restart. It did impact logstash (more of that in a different report).
Operations Engineer, Discovery
UTC+2 / CEST
We recently got a suggestion via Phabricator to automatically map
between hiragana and katakana when searching on English Wikipedia and other
wiki projects. As an always-on feature, this isn't difficult to implement,
but major commercial search engines (Google.jp, Bing, Yahoo Japan,
DuckDuckGo, Goo) don't do that. They give different results when searching
for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give
different *numbers* of results, seeming to indicate that it's not just
re-ordering the same results (say, so that results in the same script are
ranked higher). I want to know what they know that I don't!
Does anyone have any thoughts on whether this would be useful (seems that
it would) and whether it would cause any problems (it must, or otherwise
all the other search engines would do it, right?).
Any idea why it might be different between a Japanese-language wiki and a
non-Japanese-language wiki? We often are more aggressive in matching
between characters that are not native to a given language--for example,
accents on Latin characters are generally ignored on English-language
wikis. So it might make sense to merge hiragana and katakana on
English-language wikis but not Japanese-language wikis.
Thanks very much for any suggestions or information!
 Details of my tests at https://phabricator.wikimedia.org/T173650#3580309
Sr. Software Engineer, Search Platform
Here's a fun visual essay by Xaquín G.V. in collaboration with Google
News Lab. The premise is that many people search for information on
how to do everyday activities. From fixing things around the house, to
cooking, to getting rid of hiccups. There's some neat (albeit vague)
interactivity to see how search is different across different
How did we get anything done before search? :)