Discovery September 2016

discovery@lists.wikimedia.org

11 participants
12 discussions

Re: [discovery] [Analytics] Search queries: Use of modifiers
by David Causse 30 Sep '16

30 Sep '16

Hi Jan, [cc discovery mailing list] I'm glad you reach out to this list because I'm very interested to learn more about this session. The closest report we have concerning usage of search special syntax is an analysis done to classify zero result rate by query feature[1]. Unfortunately this analysis is not focused on search special syntax and address only few of the keywords supported by CirrusSearch. I created a ticket to learn more about this. Once resolved we will just have to wait to gather some data and we will be able to provide this information. PS. If you more info about what happened during this session it'd be much appreciated. Thanks! David. [1] https://upload.wikimedia.org/wikipedia/commons/2/28/From_Zero_to_Hero_-_Ant… [2] https://phabricator.wikimedia.org/T147045 Le 30/09/2016 à 09:25, Jan Dittrich a écrit : > Hello Analytics, > > Wikipedia’s search function exposes several modifiers > (https://www.mediawiki.org/wiki/Help:CirrusSearch) > On the recent German Wikicon there was a workshop on search and > several community members seemed to be enthusiastic about these functions. > > I wonder if there is existing information about the current use of > such queries. I did some research, but I could not find out much. > Such information could help to improve the search function, since > sometimes a few modifiers are heavily used (despite them being hard to > access) and could e.g. be exposed via the user interface. > > Jan > > -- > Jan Dittrich > UX Design/ User Research > > Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin > Phone: +49 (0)30 219 158 26-0 > http://wikimedia.de > > Imagine a world, in which every single human being can freely share in > the sum of all knowledge. That‘s our commitment. > > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. > V. Eingetragen im Vereinsregister des Amtsgerichts > Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig > anerkannt durch das Finanzamt für Körperschaften I Berlin, > Steuernummer 27/029/42207. > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics

1 0

Discovery Weekly Update for the week starting 2016-09-19
by Chris Koerner 24 Sep '16

24 Sep '16

Hello, Here is the Discovery status update for the week starting 19 September. Feedback and questions are welcome. == Discussions == * The team had several conversations about elastic search things (ICU_folding) and external referrers and how to get db access for investigations as well as mock design for displaying the upcoming cross-wiki search results * Data Demolition Derby was completed for cleaning up old Qualtrics survey data == Events and News == == Interactive == * Geoshapes service can now get both polygons (e.g. city outline) and lines (eg. rivers and roads) by their Wikidata ID * "properties" in GeoJSON now apply to the ExternalData geoshapes == Search == * CompletionSuggester with defaultsort demo available, feedback welcome [1] * Completed analysis of squashing Russian stress accents and folding ё to е for Russian wikis. [2] * Updated ElasticSearch document versioning in CirrusSearch [3] ** This is contingent on a full cluster restart and a deployment of the configuration change [4] * Updated Completion Suggester code for searching a subpage title using the search bar [5] * Added position increment gap to fields where positions are stored [6] * Documented our usage of the term 'PaulScore' [7] * Upgraded ElasticSearch and plugins to 2.3.5 [8] * Monitor the usage of in memory data structures used by ElasticSearch [9] == Analysis == * Completed analysis to determine if the new Wikipedia.org portal page display caused any pageview decreases to smaller wikis (UK) (analysis html and pdf docs) [10] [11] [12] * Updated dashboards to allow for easier bookmarking/copying [13] ==Wikidata Query Service== * Fixed when search for insource:tag finds "<tag>" but not "{{#tag:tag}} [14] == Wikipedia.org Portal == * Requested additional translations from translatewiki [15] * Optimize the image optimization [16] == Other Noteworthy Stuff == * Yuri and Julien are attending State of the Map 2016 conference in Brussels [1] http://mw-sug-subpages-relforge.wmflabs.org/w/default_sort_demo.html [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Removing_Stress_Acce… [3] https://phabricator.wikimedia.org/T144039 [4] https://phabricator.wikimedia.org/T146210 [5] https://phabricator.wikimedia.org/T123015 [6] https://phabricator.wikimedia.org/T145405 [7] https://phabricator.wikimedia.org/T144243 [8] https://phabricator.wikimedia.org/T145404 [9] https://phabricator.wikimedia.org/T144387 [10] https://phabricator.wikimedia.org/T143853 [11] http://wikimedia-research.github.io/Discovery-Research-Portal/ukrainian/ [12] https://commons.wikimedia.org/wiki/File:The_Wikipedia.org_Portal_and_Ukrain… [13] https://phabricator.wikimedia.org/T145478 [14] https://phabricator.wikimedia.org/T145023 [15] https://phabricator.wikimedia.org/T143338 [16] https://phabricator.wikimedia.org/T143208 ---- The full update, and archive of past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or Volunteer needed in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

2 1

New Glossary of Search Terms
by Trey Jones 19 Sep '16

19 Sep '16

Hi everyone, I've created a first draft of a small glossary of terms we use in search, including internal-only vocab (PaulScore, Discernatron, RelForge, etc.) and some general vocab (recall, precision, F1, DCG, etc). The glossary lives on mediawiki.org: https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary This isn't an overly formal glossary, so some of my opinions may have made it into the definitions. Feel free to edit, expand, editorialize, or even suggest new items to be defined. Thanks, —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

6 6

Discovery Weekly Update for the week starting 2016-09-12
by Chris Koerner 17 Sep '16

17 Sep '16

Hello, Here is the Discovery status update for the week starting 12 September. Feedback and questions are welcome. == Discussions == * We had a Discernatration demo on Tuesday, Sep 13 that was a success! [1] * Several Discovery team members attended a Product and Tech on-site two day meeting in the SF office * Mikhail and Chelsy attended an Analyst on-site two day meeting in the SF office == Interactive == * You can now insert a link to a popup map with overlays on all wikis - <maplink> [2] * Map frame support is now enabled on all Wikipedia sister projects, but not Wikipedia itself, except for HE,MK, and CA wikis - <mapframe> [3] == Portal == * updated language article stats on Wikipedia.org [4] * updated the language list dropdown button phrase to now display: Read Wikipedia in your language [5] * updated the metadata description for Wikipedia to read: "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." [6] == Search == * Automagically fixed ÿ in Spécial:IndexPages search by implementing ascii-folding in French [7] [8] * Tried to update to to ElasticSearch plugins to 2.4.0 but found out we couldn't because our analysis plugins are not compatible [9] * Added defaultsearchkeys to wiki search autocomplete [10] * Fixed bug - mwgrep and "insource:" search is missing lots of pages in its index [11] * Make elasticsearch actually have shard allocation awareness [12] * Fixed bug - "You may create the page" suggestion does not appear if search contains 'AND', 'OR', 'NOT' anywhere in search even when these are not used as special syntax [13] * Updated WikimediaMaintenance/addWiki.php to create cirrus indices on all available clusters [14] == Analysis == * Tried finding ways to combine cirrus search logs with engagement data [15] * Added event logging for the Wikipedia portal to capture the language selected in the search box [16] * Added updated the mobile dashboard for new features added in event logging [17] * Added 'other' pageviews to the dashboard accounting for 'keep alive' and other hits to the Wikipedia.org page [18] * Updated data access guidelines search logs from flourine to hive [19] == Wikidata Query Service == * Materials and videos from the SPARQL Workshop on September 8th are published [20] == Other Noteworthy Stuff == * We started to experiment with ORES WP10 as new relevance factor in fulltext search queries T145644 [21] * There's a new glossary of search-related terms available on mediawiki.org. Feel free to request additional definitions! [22] [1] https://phabricator.wikimedia.org/T144026 [2] https://www.mediawiki.org/wiki/Help:Extension:Kartographer#.3Cmaplink.3E [3] https://www.mediawiki.org/wiki/Help:Extension:Kartographer#.3Cmapframe.3E_u… [4] https://phabricator.wikimedia.org/T128546 [5] https://phabricator.wikimedia.org/T143244 [6] https://phabricator.wikimedia.org/T143239 [7] https://phabricator.wikimedia.org/T141216 [8] https://phabricator.wikimedia.org/T144429 [9] https://phabricator.wikimedia.org/T145199 [10] https://phabricator.wikimedia.org/T134978 [11] https://phabricator.wikimedia.org/T127788 [12] https://phabricator.wikimedia.org/T143571 [13] https://phabricator.wikimedia.org/T122309 [14] https://phabricator.wikimedia.org/T142181 [15] https://phabricator.wikimedia.org/T145124 [16] https://phabricator.wikimedia.org/T143149 [17] https://phabricator.wikimedia.org/T143726 [18] https://phabricator.wikimedia.org/T143605 [19] https://phabricator.wikimedia.org/T145149 [20] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/2016_SPARQL_Wor… [21] https://phabricator.wikimedia.org/T145644 [22] https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary ---- The full update, and archive of past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or Volunteer needed in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Translations requested: "showing results from"
by Deborah Tankersley 16 Sep '16

16 Sep '16

Hello, The Discovery team has been making good progress in enabling cross language search results on several wiki's and now we need help in translating a phrase: "*showing results from"*. We recently deployed <https://phabricator.wikimedia.org/T142413> [1] a language detection algorithm on the Portuguese and Japanese wiki's that will detect if certain languages are being keyed in using languages other than the main language of the wiki. For instance, we're now able to detect the following languages that are typed into a query on these primary language wiki's: Portuguese: PT, EN, RU, HE, AR, ZH, KO, EL Japanese: JA, EN, RU, KO, AR, HE But, we have a need for the system message to be translated - the system message that notifies the user that the results displayed are from a different language wiki. Here are working links from PT <https://pt.wikipedia.org/w/index.php?search=Washington+Township%2C+Licking+…> [2] and JA <https://ja.wikipedia.org/w/index.php?search=Washington+Township%2C+Licking+…> [3] that show a search example with the results displayed. *Image <https://commons.wikimedia.org/wiki/File:Showing_results_from-russian.png> [4] showing the sample results from an English search typed into the Russian Wikipedia search box.* It would be great if we can get these translations into translatewiki so that the Discovery team can use them using these message keys (and this message group link: https://translatewiki.net/wiki/Special:Translate? group=ext-wikimediainterwikisearchresults): Portugese <https://translatewiki.net/w/i.php?title=Special:Translate&group=ext-wikimed…> [5]: search-interwiki-results-enwiki search-interwiki-results-ruwiki search-interwiki-results-hewiki search-interwiki-results-arwiki search-interwiki-results-zhwiki search-interwiki-results-kowiki search-interwiki-results-elwiki Japanese <https://translatewiki.net/w/i.php?title=Special:Translate&group=ext-wikimed…> [6]: search-interwiki-results-enwiki search-interwiki-results-ruwiki search-interwiki-results-kowiki search-interwiki-results-arwiki search-interwiki-results-hewiki Cheers from the Discovery Search Team! [1] https://phabricator.wikimedia.org/T142413 [2] https://pt.wikipedia.org/w/index.php?search=Washington+ Township%2C+Licking+County%2C+Ohio&title=%D0%A1%D0%BB%D1%83% D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%9F%D0%BE%D0%B8%D1% 81%D0%BA&go=%D0%9F%D0%B5%D1%80%D0%B5%D0%B9%D1%82%D0%B8&searchToken= 34w86qi6kx0l5ax7jm0ewuuii [3] https://ja.wikipedia.org/w/index.php?search=Washington+ Township%2C+Licking+County%2C+Ohio&title=%D0%A1%D0%BB%D1%83% D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%9F%D0%BE%D0%B8%D1% 81%D0%BA&go=%D0%9F%D0%B5%D1%80%D0%B5%D0%B9%D1%82%D0%B8&searchToken= cbgdevpo338175t32wggwbhqh [4] https://commons.wikimedia.org/wiki/File:Showing_results_from-russian.png [5] https://translatewiki.net/w/i.php?title=Special: Translate&group=ext-wikimediainterwikisearchresults& language=pt&filter=&action=translate [6] https://translatewiki.net/w/i.php?title=Special: Translate&group=ext-wikimediainterwikisearchresults& language=ja&filter=&action=translate -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation

1 1

Elasticsearch sharding
by Guillaume Lederrey 16 Sep '16

16 Sep '16

Hello all! We had an interesting discussion yesterday with David about the way we do sharding of our indices on elasticsearch. Here are a few notes for whoever finds the subject interesting and wants to jump in the discussion: Context: We recently activated row aware shard allocation on our elasticsearch search clusters. This means that we now have one additional constraint on shard allocation: spread copies of shards across multiple datacenter rows, so that if we loose a full row, we still have a copy of all the data. During an upgrade of elasticsearch, another constraint comes into play: a shard can move from a node with an older version of elasticsearch to a node with a newer version, but not the other way around. This leads to elasticsearch struggling to allocate all shards during the recent codfw upgrade to elasticsearch 2.3.5. While it is not the end of the world (we can still server traffic if some indices don't have all shards allocated), this is something we need to improve. Number of shards / number of replicas: An elasticsearch index is split at creation in a number of shards. A number of replica per shard is configured [1]. The total number of shards for an index is "number_of_shards * (number_of_replicas + 1)". Increasing the number of shards per index allow to execute read operation in parallel over the different shards and aggregate the results at the end, improving response time Increasing the number of replicas allow to distribute the read load over more nodes (and provides some redundancy in case we loose one server). As term frequency [2] is calculated over a shard and not over the full index, There is some black magic involved in how we shard our indices, but most of it is documented [3] The enwiki_content example: enwiki_content index is configured to have 6 shards and 3 replicas, for a total number of 24 shards. It also has the additional constraint that there is at most 1 enwiki_content per node. This ensures a maximum spread of enwiki_content shards over the cluster. Since enwiki_content is one of the index with the most traffic, this ensure that the load is well distributed over the cluster. Now the bad news: for codfw, which is a 24 node cluster, it means that reaching this perfect equilibrium of 1 shard per node is a serious challenge if you take into account the other constraint in place. Even with relaxing the constraint to 2 enwiki shards per node, we have seen unassigned shards during elasticsearch upgrade. Potential improvements: While ensuring that a large index has a number of shards close to the number of nodes in the cluster allows for optimally spreading load over the cluster, it degrade fast if all the stars are not aligned perfectly. There are 2 opposite solutions 1) decrease the number of shards to leave some room to move them around 2) increase the number of shards and allow multiple shards of the same index to be allocated on the same node 1) is probably impractical on our large indices, enwiki_content shards are already ~30Gb and this makes it impractical to move them around during relocation and recovery 2) is probably our best bet. More smaller shards means that a single query load will be spread over more nodes, potentially improving response time. Increasing number of shards for enwiki_content from 6 to 20 (total shards = 80) means we have 80 / 24 = 3.3 shards per node. Removing the 1 shards per node constraint and letting elasticsearch spread the shards as best as it can means that in case 1 node is missing, or during an upgrade, we still have the ability to move shards around. Increasing this number even more might help keep the load evenly spread across the cluster (the difference between 8 or 9 shards per node is smaller than the difference between 3 or 4 shards per node). David is going to do some tests to validate that those smaller shards don't impact the scoring (smaller shards mean worse frequency analysis). I probably forgot a few points, but this email is more than long enough already... Thanks to all of you who kept reading until the end! MrG [1] https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_conc… [2] https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.… [3] https://wikitech.wikimedia.org/wiki/Search#Estimating_the_number_of_shards_… -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

3 2

Weekly update for the week starting 2016-09-05
by Deborah Tankersley 12 Sep '16

12 Sep '16

Hello, Here is the week's update from the Discovery department - enjoy the read and your weekend! == Discussions == * Trey completed the analysis for optimizing language identification for the Dutch Wikipedia (nlwiki). The results were good (F0.5 = 82.3%) but not great. The small proportions of queries in the Romance languages and in German led to many more false positives than true positives and so they had to be excluded. Future work on improving confidence may help. [1] * We could use help translating (via translatewiki) the relevant "showing results from" messages into Dutch. We'll need English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian translations. [2] * The Analysis team had a discussion on how to use better wording for phrases like "users were 1.07 times more likely to do X" and decided on using phrases similar to "we can expect 2-9 more sessions to click on a search result when they have the new feature" [3] * The Search team wrapped up research into the ElasticSearch instabilities on the eqiad search cluster that occurred on Aug 6, 2016; nothing conclusive was found. [4] == Events and News == === Interactive === * <maplink> has been enabled on all wikis (announced via email to wikitech-l) [5] * Geoshapes data service is now integrated into all maps [6] === Search === * Turned off BM25 A/B test, awaiting analysis [7] * Pushed into production a change that implemented ascii-folding for French [8] * Improved balance of nodes across rows for ElasticSearch eqiad cluster [9] === Portal === * Currently blocked on this check-in to gerrit [10] == Other Noteworthy Stuff' == * Our elasticsearch clusters now have "row aware shard allocation". This means that we can theoretically lose one row of servers in our datacenter and still serve search traffic. [11] * The Search team sent out a request for comment article that was posted to various Village Pumps asking for it to be translated. [12] ** This was in reference to the cross-wiki search results new functionality and design articles on MediaWiki. [13], [14] == Did you know? == * A study came out yesterday showing that giraffes are actually four distinct species, rather than one (article and BBC report). [15], [16] ** Of course, the English and German Wikipedia pages on giraffes have already been updated! [17], [18] [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization… [2] https://phabricator.wikimedia.org/T143354 [3] https://phabricator.wikimedia.org/T140187 [4] https://phabricator.wikimedia.org/T142506 [5] https://lists.wikimedia.org/pipermail/wikitech-l/2016-September/086490.html [6] https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_extern… [7] https://phabricator.wikimedia.org/T143588 [8] https://phabricator.wikimedia.org/T144429 [9] https://phabricator.wikimedia.org/T143685 [10] https://gerrit.wikimedia.org/r/#/c/306241/ [11] https://phabricator.wikimedia.org/T143571 [12] https://meta.wikimedia.org/wiki/User:DTankersley_(WMF)/translation_request_… [13] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements [14] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design [15] http://www.cell.com/current-biology/fulltext/S0960-9822(16)30787-4 [16] http://www.bbc.com/news/science-environment-37311716 [17] https://en.wikipedia.org/wiki/Giraffe [18] https://de.wikipedia.org/wiki/Giraffe ---- The full update, and archive of past updates, can be found on Mediawiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or volunteer needed in Phabricator: [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Cheers! -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation

3 3

Discernatron lunch - 12pm (SF time) Tue 13th Sep
by Dan Garry 10 Sep '16

10 Sep '16

The Search Team in Discovery needs your help! Discernatron [1] is a search relevance tool developed by the Discovery department. Its goal is to help improve search relevance - showing articles that are most relevant to search queries - with human assistance. We need your help grading search results! Join us for lunch at 12pm (SF time) in the 5th floor lounge on Tuesday 13th September! In the Discernatron lunch, we'll give a brief overview of what the Discernatron is, then ask people to get rating queries, so bring your laptops! We're hoping a limited amount of food will be provided for the event, but you can only eat it if you agree to rate queries for us. ;-) We'll also be set up for remote participation on Hangouts and IRC, and the session will be recorded. Hangout: https://hangouts.google.com/hangouts/_/7pcv3gtfcbczzhaxbezyqhedfee YouTube stream: https://www.youtube.com/watch?v=q4W9t6IcjWk Thanks! If there are any questions, let me know! Dan [1]: https://www.mediawiki.org/wiki/Discernatron -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

2 1

Multimedia search on Commons
by Pine W 07 Sep '16

07 Sep '16

Hi Discovery, I'm wondering if there are any significant improvements coming in the next +/- 18 months for multimedia search on Commons. Finding images can be a very time-consuming job for Wikimedians and other users of the site. For example, it would be nice to be able to do a "join" search with categories, so that only media files that appear in 2+ selected categories are shown in search results. As an example, suppose that I want an image of a blue tile roof in China. The search "blue tile roof China" doesn't show any results that interest me on the first page. However, https://commons.wikimedia.org/wiki/File:Wuhan_University_-_roof_tiles.JPG is an image that would interest me. That file is several layers deep in subcategories under the China category. It would be nice to be able to find that image by searching a join of the China category and its subcategories, with the "blue roof" category. Thanks, Pine

2 1

New functionality: cross-wiki search results
by Deborah Tankersley 07 Sep '16

07 Sep '16

The Discovery Search Team wants to enable search results on Wikipedia that will include articles gathered across all sister wiki projects – within the same language – but we need your feedback. Please read t he specifics of how this new functionality <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements> [1] might work and add comments, concerns, or alternative ideas for design options <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design> [2] on the talk pages . See an image <https://www.mediawiki.org/wiki/File:Search_results_page-enwiki_right-hand-b…> [3] that shows one of many example display options that have been mocked up, after considering what other wiki communities have done. Thank you for your time and c heers from the Search Team! [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements [2] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design [3] *https://www.mediawiki.org/wiki/File:Search_results_page-enwiki_right-hand-box-general-projects.png <https://www.mediawiki.org/wiki/File:Search_results_page-enwiki_right-hand-b…>* -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation

4 6

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery September 2016