Discovery September 2017

discovery@lists.wikimedia.org

4 participants
9 discussions

Disabling Messaging Fallbacks for Language Analysis
by Trey Jones 25 Oct '17

25 Oct '17

SUMMARY: The Search Platform team (formerly part of Discovery) is planning to fix a long-standing search bug on many wiki projects by disabling the code in CirrusSearch that re-uses the “fallback” languages (which are specified for user interface or system messages) for the language analysis modules (which are used to index words in search). Deployment is planned to start the week of October 9, 2017. Messaging fallbacks specify what language to show a message in when there is no message available in the language of a given wiki. A language analysis module is language-specific software that processes text to improve searching—so that, for example, searching for a given word will find related forms of that word, like "hope, hopes, hoping, hoped" or "resume, resumé, résumé" on English-language wikis. Fallback languages for system messages make sense for historical and cultural reasons—a reader of the Chechen Wikipedia is more likely to understand a user interface or system message in Russian than in French, Greek, Hindi, Italian, or Japanese—but the fallbacks don't necessarily make any linguistic sense. Chechen and Russian, for example, are from unrelated language families; while the languages have undoubtedly influenced one another, their grammars are completed different. We will deploy the software change that disables using messaging fallbacks for language analysis fallbacks in about two weeks (targeting the week of October 9, 2017), with any cross-language analysis exceptions explicitly configured in a new manner. Changes will not immediately happen to all affected wikis because each wiki in each language will need to be re-indexed, which is a separate process that takes time. There may also be other delays caused by Elasticsearch upgrades or other changes that need immediate attention. You can also track progress of the tasks on Phabricator[1] or read more, see examples, and get the full list of languages affected on MediaWiki.[2] [1] https://phabricator.wikimedia.org/T147959 [2] https://www.mediawiki.org/wiki/Wikimedia_Discovery/Disabling_Messaging_Fall… Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

1 1

Discovery Weekly Update for the week starting 2017-09-18
by Chris Koerner 26 Sep '17

26 Sep '17

Hello, It's been a busy week in Discoveryland. Here are the updates from the Discovery team for last week. As always, feedback and questions are welcome. Reminder: There is a new way to follow these weekly updates.You can subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update. Subscribe to be notified! https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly ==Highlights== * The explore similar language links A/B test has been completed and analysis has been done. Unfortunately, we only had one clickthrough to an article written in a different language (which was displayed in the new language links) as the report documents. We will not be going forward with this feature. [0] ** However, if a user wants to have the language link script added to their logged-in account, please follow these instructions. [1] ** The full explore similar script (displays related articles, categories and language links) can also be enabled for logged-in users, see the instructions. [2] * This latest A/B test (as noted directly above) effectively closes out the additional features that the Discovery Department were exploring to possibly add to the search engine results page (SERP); additional details can be read online [3]; overall A/B testing details can be found on mw.org [4] and self-guided testing instructions as well. [5] ==Discussions== === Search === * After successfully testing and deploying the machine learning to rank model on English Wikipedia [6], we have deployed a new test out to 18 other wikis that have >1% of traffic this week. [7] * For the relevance survey, Erik developed backend infrastructure to support lots of queries and lots of results per query [8] and the third running of the test was turned off this week [9], analysis will be detailed in [10] * The Chinese wiki was re-indexed [11], allowing multi-hyphen tokens to be enabled in production [12] * The Hebrew language wikis were also re-indexed [13] and the HebMorph plugin was also deployed [14] * We updated Vagrant to include new language plugins (Polish, Ukrainian, Chinese and Hebrew) [15] * After some exhaustive investigation, we've resolved the recent load spikes on the elasticsearch cluster in eqiad [16] * Jan has nearly finished the first Selenium test re-written from Ruby to Node.js and has learned a lot in the process. This first test will help to pave the way forward for the rest of the tests that will need to be re-written [17] [18] * We've completed testing for adding support of interleaved search results [19] and currently wrapping up the analysis of the test [20] * Fixed an issue with using mixed versions of the ltr plugin being deployed on elastic1020 [21] * Erik created a few bash scripts to send from terbium when reindexing the default namespaces [xx] (which were moved from general to content indices); this will go into effect when we reindex the wikis again [22] * The second running of the explore similar A/B test for language links was completed on Thursday [23] and analysis is complete [24]; the report can be read online. [25] === Analysis === * Chelsy finalized her work of creating a (mostly) automated and parameterized report template for the Search Platform teams's A/B tests [26] * Chelsy also completed some additional API usage break out (internal vs external) on the metrics dashboard [27] [28] * Chelsy also finalized a new method to keep data longer (that isn't in a dashboard) by adding reports into golden (/srv/published-datasets/discovery) [29] * Mikhail created a dashboard to track the prevalence of sister project search results on fulltext search result pages on desktop, broken up by language. For example, it turns out that nearly 80% of fulltext searches show sister projects on enwiki. [30] === Portal === * Jan has been working on updating the Wikipedia portal, to adjust the languages used for Chinese translations [31] === Maps === * Gehel cleared up some vm space on Horizon by deleting 4 unused maps-team instances [32] * The map service has been upgraded to Node.js 6.11 [33] * Map traffic has been enabled for active / active service (serving map tiles from both data centers) [34] [0] https://analytics.wikimedia.org/datasets/discovery/reports/Explore_Similar_… [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/explor… [2] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/self-g… [3] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements [4] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testing [5] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/self-g… [6] https://phabricator.wikimedia.org/T175772 [7] https://phabricator.wikimedia.org/T175771 [8] https://phabricator.wikimedia.org/T174387 [9] https://phabricator.wikimedia.org/T175047 [10] https://phabricator.wikimedia.org/T174106 [11] https://phabricator.wikimedia.org/T173464 [12] https://phabricator.wikimedia.org/T172653 [13] https://phabricator.wikimedia.org/T167058 [14] https://phabricator.wikimedia.org/T167057 [15] https://phabricator.wikimedia.org/T164367 [16] https://phabricator.wikimedia.org/T169498 [17] https://gerrit.wikimedia.org/r/#/c/378688/ [18] https://phabricator.wikimedia.org/T174103 [19] https://phabricator.wikimedia.org/T150032 [20] https://phabricator.wikimedia.org/T171215 [21] https://phabricator.wikimedia.org/T175951 [22] https://phabricator.wikimedia.org/T176397 [23] https://phabricator.wikimedia.org/T175649 [24] https://phabricator.wikimedia.org/T175650 [25] https://analytics.wikimedia.org/datasets/discovery/reports/Explore_Similar_… [26] https://phabricator.wikimedia.org/T131795 [27] http://discovery.wmflabs.org/metrics/#referer_breakdown [28] https://phabricator.wikimedia.org/T172452 [29] https://phabricator.wikimedia.org/T172453 [30] https://discovery.wmflabs.org/metrics/#sister_search_prevalence [31] https://phabricator.wikimedia.org/T171647 [32] https://phabricator.wikimedia.org/T175998 [33] https://phabricator.wikimedia.org/T171707 [34] https://phabricator.wikimedia.org/T162362 ---- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

2 1

Logstash not collecting logs while the elasticsearch cirrus cluster is down
by Guillaume Lederrey 21 Sep '17

21 Sep '17

Hello! Related to my previous incident report [1], we also had an issue with logstash [2]. Logstash stops collecting logs while elasticsearch / cirrus is down. This is most probably related to API Feature logging, which are sent by logstash to the cirrus cluster. Sadly, there are no obvious fix at this point. It might be possible to tune the elasticsearch output plugin to fail fast, but that is not obvious from the documentation. [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastic… [2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Logstash -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

1 0

Failed restart of the elasticsearch eqiad cluster
by Guillaume Lederrey 21 Sep '17

21 Sep '17

Hello! TL;DR: Our recent elasticsearch cluster restart did not go as planned. Most important lesson learned: we did not understand the recovery settings correctly. Yesterday, we did a cold restart of the elasticsearch / cirrus eqiad cluster. This restart did not go as planned. It did not generate any user facing impact, since we moved all the traffic to codfw before the restart. It did impact logstash (more of that in a different report). Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastic… Have fun! Guillaume -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

1 0

Mapping Hiragana and Katakana
by Trey Jones 20 Sep '17

20 Sep '17

We recently got a suggestion via Phabricator[1] to automatically map between hiragana and katakana when searching on English Wikipedia and other wiki projects. As an always-on feature, this isn't difficult to implement, but major commercial search engines (Google.jp, Bing, Yahoo Japan, DuckDuckGo, Goo) don't do that. They give different results when searching for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give different *numbers* of results, seeming to indicate that it's not just re-ordering the same results (say, so that results in the same script are ranked higher).[2] I want to know what they know that I don't! Does anyone have any thoughts on whether this would be useful (seems that it would) and whether it would cause any problems (it must, or otherwise all the other search engines would do it, right?). Any idea why it might be different between a Japanese-language wiki and a non-Japanese-language wiki? We often are more aggressive in matching between characters that are not native to a given language--for example, accents on Latin characters are generally ignored on English-language wikis. So it might make sense to merge hiragana and katakana on English-language wikis but not Japanese-language wikis. Thanks very much for any suggestions or information! —Trey [1] https://phabricator.wikimedia.org/T176197 [2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309 Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2017-09-11
by Chris Koerner 19 Sep '17

19 Sep '17

Hello, Here are the updates from the Discovery team for last week. As always, feedback and questions are welcome. Reminder: There is a new way to follow these weekly updates.You can subscribe to recieve on-wiki (or opt-in email) notifications of the Discovery weekly update. Subscribe to be notified. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly ==Discussions== === Search === * Gehel did quite a bit of work to implement the new version of Logstash on our servers, thanks for the help, RobH! [0] * David, Gehel and Moritz updated the elasticsearch deployment plugins to use debian packages instead of salt [1] * Stas and David fixed an issue where Wikidata Elastic search drops results with matches with different language labels (e.g., you search in English but Spanish label matches) [2] * The explore similar test for language links A/B test was turned on Sep 14 and will run for a week. [3] [4] === Analysis === * Mikhail finished up a new dashboard metric for dwell time on SERP [5] [6] * Chelsy finished up the final analyzation of the results of the swap2and3 search test and it's up on Commons. [7] [8] [0] https://phabricator.wikimedia.org/T175045 [1] https://phabricator.wikimedia.org/T158560 [2] https://phabricator.wikimedia.org/T173231 [3] https://phabricator.wikimedia.org/T175647 [4] https://phabricator.wikimedia.org/T175648 [5] https://discovery.wmflabs.org/metrics/#spr_surv [6] https://phabricator.wikimedia.org/T170468 [7] https://phabricator.wikimedia.org/T136017 [8] https://commons.wikimedia.org/wiki/File:Swap2and3_Search_Test_Analysis.pdf ---- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2017-09-04
by Chris Koerner 12 Sep '17

12 Sep '17

Hello, A few updates this week from across the Discovery team. Programming note: There is a new way to follow these weekly updates. You can subscribe to recieve on-wiki (or opt-in email) notifications of the Discovery weekly update. Subscribe to be notified. https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly == Discussions == * Paul conducted a cartographic review of the new map styles for the Design team on 6 Sep 2017. === Search === * Gehel updated logstash to stop deploying custom elasticsearch plugins that are no longer used and did a rolling restart of elasticsearch/logstash [0] * David upgraded an extra plugin as part of the upgrade to elastic 5.5.x [1] * Erik turned on a third test for the Search Relevance Survey (graded by humans), the test is expected to take a week. [2] * Erik trained MLR models for all wikis with >= 1% of search traffic, in preparation for AB test next week * Erik investigating potential memory leaks in MLR model training which makes training for large wikis (de, en) error prone === Portal === * Jan fixed a bug where the typeahead wasn't working for a single character [3] * Jan also updated the Wikipedia portal statistics and translations on 6 Sep 2017 [4] [5] === Maps === * Gehel finished up reimaging the maps-test servers [6] [0] https://phabricator.wikimedia.org/T174933 [1] https://phabricator.wikimedia.org/T174652 [2] https://phabricator.wikimedia.org/T175046 [3] https://phabricator.wikimedia.org/T173885 [4] https://phabricator.wikimedia.org/T128546 [5] https://phabricator.wikimedia.org/T142582 [6] https://phabricator.wikimedia.org/T169011 ---- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

An essay about searching for "how-tos"
by Chris Koerner 07 Sep '17

07 Sep '17

Hello, Here's a fun visual essay by Xaquín G.V. in collaboration with Google News Lab. The premise is that many people search for information on how to do everyday activities. From fixing things around the house, to cooking, to getting rid of hiccups. There's some neat (albeit vague) interactivity to see how search is different across different demographics. How did we get anything done before search? :) http://how-to-fix-a-toilet.com Yours, Chris Koerner Community Liaison Wikimedia Foundation

2 1

Discovery Weekly Update for the week starting 2017-08-28
by Chris Koerner 06 Sep '17

06 Sep '17

Hi, Here is the weekly status update from the Discovery team. Feedback and questions are welcome. == Discussions == === Search === * Erik turned off the interleaved search results test and the search relevance test [0] [1] * Jan completed a minor update to the UI of the sister project snippets [2] * Erik continued work on mitigating future occurrences of search latency degradation [3] === Analysis === * Mikhail finished up the analysis for the first two A/B tests for the search relevance testing (grading by humans) [4] [5] * Chelsy put the final touches on the explore similar A/B test analysis [6] [7] * Mikhail reviewed an upcoming patch to add purge info for the Kartographer schema [8] === Maps === * Gehel aligned the maps* and maps-test* configurations to be the same [9] * After many discussions on what is a reasonable per-IP ratelimit for maps, decisions have been made and was promoted to production on Sep 4 [10] == Other Noteworthy Stuff == * Trey wrote a post for the WIkimedia Blog: "Wikipedia, search, and the 'Цкщтп' keyboard" [11] [0] https://phabricator.wikimedia.org/T171214 [1] https://phabricator.wikimedia.org/T171742 [2] https://phabricator.wikimedia.org/T171804 [3] https://phabricator.wikimedia.org/T169498 [4] https://wikimedia-research.github.io/Discovery-Search-Adhoc-SurveyMVP/ [5] https://phabricator.wikimedia.org/T171740 [6] https://wikimedia-research.github.io/Discovery-Search-Test-ExploreSimilar/ [7] https://phabricator.wikimedia.org/T164857 [8] https://phabricator.wikimedia.org/T171622 [9] https://phabricator.wikimedia.org/T169082 [10] https://phabricator.wikimedia.org/T169175 [11] https://blog.wikimedia.org/2017/08/28/wikipedia-search-phonetic-keyboards/ --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery September 2017