SUMMARY: The Search Platform team (formerly part of Discovery) is planning
to fix a long-standing search bug on many wiki projects by disabling the
code in CirrusSearch that re-uses the “fallback” languages (which are
specified for user interface or system messages) for the language analysis
modules (which are used to index words in search). Deployment is planned to
start the week of October 9, 2017.
Messaging fallbacks specify what language to show a message in when there
is no message available in the language of a given wiki. A language
analysis module is language-specific software that processes text to
improve searching—so that, for example, searching for a given word will
find related forms of that word, like "hope, hopes, hoping, hoped" or
"resume, resumé, résumé" on English-language wikis.
Fallback languages for system messages make sense for historical and
cultural reasons—a reader of the Chechen Wikipedia is more likely to
understand a user interface or system message in Russian than in French,
Greek, Hindi, Italian, or Japanese—but the fallbacks don't necessarily make
any linguistic sense. Chechen and Russian, for example, are from unrelated
language families; while the languages have undoubtedly influenced one
another, their grammars are completed different.
We will deploy the software change that disables using messaging fallbacks
for language analysis fallbacks in about two weeks (targeting the week of
October 9, 2017), with any cross-language analysis exceptions explicitly
configured in a new manner. Changes will not immediately happen to all
affected wikis because each wiki in each language will need to be
re-indexed, which is a separate process that takes time. There may also be
other delays caused by Elasticsearch upgrades or other changes that need
immediate attention.
You can also track progress of the tasks on Phabricator[1] or read more,
see examples, and get the full list of languages affected on MediaWiki.[2]
[1] https://phabricator.wikimedia.org/T147959
[2]
https://www.mediawiki.org/wiki/Wikimedia_Discovery/Disabling_Messaging_Fall…
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
Hello!
Related to my previous incident report [1], we also had an issue with
logstash [2].
Logstash stops collecting logs while elasticsearch / cirrus is down.
This is most probably related to API Feature logging, which are sent
by logstash to the cirrus cluster. Sadly, there are no obvious fix at
this point. It might be possible to tune the elasticsearch output
plugin to fail fast, but that is not obvious from the documentation.
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastic…
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Logstash
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST
Hello!
TL;DR: Our recent elasticsearch cluster restart did not go as planned.
Most important lesson learned: we did not understand the recovery
settings correctly.
Yesterday, we did a cold restart of the elasticsearch / cirrus eqiad
cluster. This restart did not go as planned. It did not generate any
user facing impact, since we moved all the traffic to codfw before the
restart. It did impact logstash (more of that in a different report).
Incident documentation:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastic…
Have fun!
Guillaume
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST
We recently got a suggestion via Phabricator[1] to automatically map
between hiragana and katakana when searching on English Wikipedia and other
wiki projects. As an always-on feature, this isn't difficult to implement,
but major commercial search engines (Google.jp, Bing, Yahoo Japan,
DuckDuckGo, Goo) don't do that. They give different results when searching
for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give
different *numbers* of results, seeming to indicate that it's not just
re-ordering the same results (say, so that results in the same script are
ranked higher).[2] I want to know what they know that I don't!
Does anyone have any thoughts on whether this would be useful (seems that
it would) and whether it would cause any problems (it must, or otherwise
all the other search engines would do it, right?).
Any idea why it might be different between a Japanese-language wiki and a
non-Japanese-language wiki? We often are more aggressive in matching
between characters that are not native to a given language--for example,
accents on Latin characters are generally ignored on English-language
wikis. So it might make sense to merge hiragana and katakana on
English-language wikis but not Japanese-language wikis.
Thanks very much for any suggestions or information!
—Trey
[1] https://phabricator.wikimedia.org/T176197
[2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
Hello,
Here's a fun visual essay by Xaquín G.V. in collaboration with Google
News Lab. The premise is that many people search for information on
how to do everyday activities. From fixing things around the house, to
cooking, to getting rid of hiccups. There's some neat (albeit vague)
interactivity to see how search is different across different
demographics.
How did we get anything done before search? :)
http://how-to-fix-a-toilet.com
Yours,
Chris Koerner
Community Liaison
Wikimedia Foundation