Discovery

discovery@lists.wikimedia.org

756 discussions

Fwd: [Wikitech-l] Today's CREDIT demo - Wikidata Query Service update, including on federation
by Deborah Tankersley 02 Aug '17

02 Aug '17

Nice demo, Stas! :) -- deb tankersley irc: debt Product Manager, Discovery Wikimedia Foundation ---------- Forwarded message ---------- From: Adam Baso <abaso(a)wikimedia.org> Date: Wed, Aug 2, 2017 at 12:10 PM Subject: [Wikitech-l] Today's CREDIT demo - Wikidata Query Service update, including on federation To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hi all - just one demo today, but as always it's a treat to see Stas's updates on Wikidata Query Service (WDQS). Enjoy! https://www.youtube.com/watch?v=zfjY9JU0NR0 https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual -Adam _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2 2

Discovery Weekly Update for the week starting 2017-07-24
by Chris Koerner 31 Jul '17

31 Jul '17

Greetings, This is the weekly update for Discovery. As always, feedback and questions are welcome. == Discussions == === Search === * Erik's bugfixes: T171009 (don't issue a near match query with empty query), T171155 (match redirects with the ''intitle'' keyword). [0] [1] * Stas' Wikidata entity search with Elastic is in testing [2] [3] * Stas' work on the archive search is in final call, to be enabled next week. [4] * David worked on collecting features from the sltr query by using a custom Accumulator to collect feature names and extracting feature vectors from the ltr plugin [5] * Trey's language analysis analysis tools will soon be released in RelForge. See phab ticket T171516. [6] Comments on the extensive/excessive documentation copied to MediaWiki are welcome. [7] === Analysis === * Chelsy found a bug in the ''renderMenu'' function in ''shinydashboard'' package that was fixed—but we need to wait for latest version to be uploaded to CRAN before we can patch our instance [8] === Interactive === * Mapframe was enabled for cswiki by a volunteer contributor. [0] [0] https://phabricator.wikimedia.org/T171009 [1] https://phabricator.wikimedia.org/T171155 [2] https://www.wikidata.org/wiki/Wikidata:Project_chat#wbsearchentities_with_E… [3] http://elastic-wikidata.wmflabs.org/wb.html [4] https://phabricator.wikimedia.org/T163235 [5] https://phabricator.wikimedia.org/T168813 [6] https://phabricator.wikimedia.org/T171516 [7] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Analysis_Analysis_To… [8] https://phabricator.wikimedia.org/T170724 [9] https://phabricator.wikimedia.org/T171588 -- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2017-07-17
by Chris Koerner 21 Jul '17

21 Jul '17

Hey all, This is the Discovery weekly update for the week of 2017-07-17 _including_ updates from the week of 2017-07-10 because I'm a dummy and forgot to send the update email last week! Feedback and questions are welcome. ==Highlights== * Quarterly Review check-in's were presented this week (links forthcoming). ==Discussions== * Had a good conversation on how we want to do an upcoming A/B test to have search results interleaved [0] * Also had a really interesting conversation on how we should move forward with metrics and KPIs for the Search Platform and Discovery. === Search === * After extensive help and feedback from the Japanese Wikipedia community, we decided ''not'' to deploy the new Japanese language analyzer. [1] * After testing and analysis, we are ''not'' going to deploy the Vietnamese language analyzer [2] * Did some quick analysis of the RelForge impact of initial learning-to-rank model [3] * Completed a high level task of organizing necessary adjustments to the ElasticsSearch learning-to-rank plugin [4] * Updated the Wikidata code to take advantage of nested fields noop script [5] * Analyzing the new Kuromoji Japanese language analyzer is done, just waiting to merge the updated configuration [6] * A/B test for the new search results Explore Similar feature continues after fixing a minor bug [7] [8] * Fixed a mapping bug when using text for 'did you mean' suggestions [9] * Updated the 'cirrussearch-explore-similar-languages-none' message to be more meaningful (will be deployed the week of July 25, 2017) [10] === Analysis === * To track sister project search feature, the Search Metrics dashboard now shows traffic to various Wikimedia projects from Wikipedia search results pages [11] [12] * We found what lowered full-text search PaulScores – it was a change of event logging sampling rates (decreasing enwiki and increasing all other wikis). [13] [14] * Investigated the seemingly odd statistic of mobile web having the highest usage of the sister project snippets in the search results page; conclusion: nothing to worry about, everything is being logged correctly. [15] === Portal === * The Wikipedia portal stats and translations were updated on July 18, 2017 [16] [17] === Maps === * A usage policy was recently changed for the relief data layer that we use in maps to now require an API key; a key was procured and Max S helped us get it into production. This only effected maps on Wikivoyage. [18] [0] https://phabricator.wikimedia.org/T150032 [1] https://phabricator.wikimedia.org/T166731 [2] https://phabricator.wikimedia.org/T170423 [3] https://phabricator.wikimedia.org/T169103 [4] https://phabricator.wikimedia.org/T162062 [5] https://phabricator.wikimedia.org/T166589 [6] https://phabricator.wikimedia.org/T166731#3439190 [7] https://phabricator.wikimedia.org/T149809 [8] https://gerrit.wikimedia.org/r/364834 [9] https://phabricator.wikimedia.org/T155489 [10] https://phabricator.wikimedia.org/T169302 [11] https://discovery.wmflabs.org/metrics/ [12] https://discovery.wmflabs.org/metrics/#sister_search_traffic [13] https://discovery.wmflabs.org/metrics/#paulscore_approx [14] https://phabricator.wikimedia.org/T168466 [15] https://phabricator.wikimedia.org/T170183 [16] https://phabricator.wikimedia.org/T128546 [17] https://phabricator.wikimedia.org/T142582 [18] https://phabricator.wikimedia.org/T170976 --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Wired article "Search Algorithms Kept Me From My Sister for 14 Years"
by Chris Koerner 13 Jul '17

13 Jul '17

Thanks to Erica Litrenta for sharing this with me. I thought I'd share if forward. "It was because of the letter K that I found my younger sister, but for 14 years, it was also the letter K that kept us apart." https://www.wired.com/story/search-algorithms-kept-me-from-my-sister-for-14… Yours, Chris Koerner Community Liaison Wikimedia Foundation

5 10

Discovery Weekly Update for the week starting 2017-07-03
by Chris Koerner 12 Jul '17

12 Jul '17

Hello, This is the Discovery update for last week. Apologies for the delay in getting it out. == Discussions == === Search === * Created a method for the Kafka consumer to take 'learn to rank' queries from a queue and run them against ElasticSearch to generate relevance labels [0] * Added in the ability to use kafka in our LTRank feature generation queries and pushing them into ElasticSearch for analysis [1] * Added ability to extract TF and IDF based features in the ElasticSearch 'learning to rank' plugin [2] * A/B test still in progress 'explore similar' links, but we're running into a few bugs that will be sorted out next week [3] * Fixed a bug where searching for phrase queries did not highlight page content [4] === Analysis === * Fixed a bug with the sister project snippets and eventlogging [5] * Finished up analysis for determining what is a reasonable per-IP ratelimit for maps [6] * Fixed a minor dashboard bug (splines) [7] [0] https://phabricator.wikimedia.org/T162059 [1] https://phabricator.wikimedia.org/T162072 [2] https://phabricator.wikimedia.org/T167437 [3] https://phabricator.wikimedia.org/T164856 [4] https://phabricator.wikimedia.org/T167798 [5] https://phabricator.wikimedia.org/T168916 [6] https://phabricator.wikimedia.org/T169175 [7] https://phabricator.wikimedia.org/T169125 Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Re: [discovery] Wired article "Search Algorithms Kept Me From My Sister for 14 Years"
by Øystein Håberg 10 Jul '17

10 Jul '17

Very interesting! 10. jul. 2017 16:34 skrev "Chris Koerner" <ckoerner(a)wikimedia.org>: Thanks to Erica Litrenta for sharing this with me. I thought I'd share if forward. "It was because of the letter K that I found my younger sister, but for 14 years, it was also the letter K that kept us apart." https://www.wired.com/story/search-algorithms-kept-me- from-my-sister-for-14-years Yours, Chris Koerner Community Liaison Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

1 0

Discovery Weekly Update for the week starting 2017-06-26
by Chris Koerner 30 Jun '17

30 Jun '17

Hello everyone, Here's the weekly update for Discovery and the last update in June. Can you believe it‽ As always, feedback and questions are welcome. ==Discussions== === Search === * Evaluated several training set sizes for the learning-to-rank models [0] * Completed several successful load tests on the servers for learning-to-rank query plugin [1] * Fixed a bug with second try searching using language detection [2] * Updated CirrusSearch to not depend on spaces to activate proximity rescoring (phrase rescore) [3] * Explore similar pages, categories and languages A/B test was activated on June 29, 2017 [4] === Analysis === * Updated the dashboards polloi package that now allows country names and assigning of ASCII region names [5] === Portal === * Statistics and translations were updated on the Wikipedia.org portal page on June 27, 2017 [6] [7] === Interactive === * A bug where coordinates were being passed incorrectly to external map services was fixed in production [8] * Discussion and a patch uploaded to update Kartographer so that jQuery 3 doesn't break things [9] [0] https://phabricator.wikimedia.org/T168664 [1] https://phabricator.wikimedia.org/T169002 [2] https://phabricator.wikimedia.org/T168302 [3] https://phabricator.wikimedia.org/T152094 [4] https://phabricator.wikimedia.org/T149809 [5] https://phabricator.wikimedia.org/T167913 [6] https://phabricator.wikimedia.org/T128546 [7] https://phabricator.wikimedia.org/T142582 [8] https://phabricator.wikimedia.org/T160782 [9] https://phabricator.wikimedia.org/T168744 -- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Elasticsearch slowdown
by Guillaume Lederrey 30 Jun '17

30 Jun '17

Hello! We've had a significant slowdown of elasticsearch today (see Grafana for exact timing [1]). The impact was low enough that it probably does not require a full incident report (the number of errors did not raise significantly [2]), but understanding what happened and sharing that understanding is important. This is going to be a long and technical email, you might get bored, feel free to close it and delete it right now. TL;DR: elastic1019 was overloaded, having too many heavy shards, banning all shards from elastic1019 to reshuffle allowed it to recover. In more details: elastic1019 was hosting shards for commonswiki, enwiki and frwiki, which are all high load shards. elastic1019 is one of our older server, which are less powerfull, and might also suffer from CPU overheating [3]. The obvious question: "why do we even allow multiple heavy shards to be allocated on the same node?". The answer is obvious as well: "it's complicated...". One of the very interesting feature of elasticsearch is its ability to automatically balance shards. This allows the cluster to automatically rebalance in case nodes are lost, and to automatically balance shards to spread resource usage across all nodes in the cluster [4]. Constraints can be added to account for available disk space [5], rack awareness [6], or even have specific filtering for specific indices [7]. It does not directly allow to constraint allocation based on the load of a specific shard. We do have a few mechanism to ensure that load is as uniform as possible on the cluster: An index is split in multiple shards, a shard is replicated multiple times to provide redundancy and to spread load. Those are configured by index. We know which are the heavy indices (commons, enwiki, frwiki, ...), both in term of size and in term of traffic. Those indices are split in a number of shards+replicas close to the number of nodes in the cluster, to ensure that those shards are spread evenly on the cluster, with only a few shards of the same index on the same node, but still allow to loose a few nodes and keep all shards allocated. For example, enwiki_content has 8 shards, with 2 replicas each, so a total number of 24 shards, with a maximum of 2 shards on the same node. This approach works well most of the time. The limitation is that a shard is a "scalability unit", you can't move around something smaller than a shard. In the case of enwiki, a single shard is ~40Go and a fairly large number of requests per second. If you have a node that has just one more of those shards, that's already a significant amount of additional load. The solution could be to split large indices in a lot more shards, the scalability unit would be much smaller, and it would be much easier to have a uniform load. Of course, there are also limitations. The total number of shards in the cluster has a significant cost. Increasing it will add load to cluster operations (which are already quite expensive in with the total number of shards we have at this point). There are also functional issues: ranking (BM25) uses statistics calculated per shard, with smaller shards at some point the stats might not be relevant of the whole corpus. There are probably a lot more detail we could get into, feel free to ask more questions and we can continue the conversation. And I'm sure David and Erik have a lot to add! Thanks for reading to the end! Guillaume [1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=… [2] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=… [3] https://phabricator.wikimedia.org/T168816 [4] https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allo… [5] https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-alloca… [6] https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-… [7] https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-alloc… -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

2 2

Discovery Weekly Update for the week starting 2017-06-19
by Chris Koerner 26 Jun '17

26 Jun '17

Hello, A few quick updates form the Discovery department this week. Feedback and questions are welcome. == Highlights == * Mikhail sent out an email describing the cool new backend update that all the dashboards got - from Vagrant to Puppet [0] == Discussions == * A letter from some of the attendees of Iberoconf 2017 detailing requests to the Foundation for more and better communications and translations. [1] [2] * The Analysis team started discussing about adding continuous integration configuration to all wikimedia/discovery analytics repositories [3] === Search === * Erik did a lot of intensive research evaluating the libraries for machine learning ranking with LambdaRank and settled on xgboost running over yarn [4] * We are having a lively discussion about how best to utilize the Kuromoji Japanese language analyzer [5] === Analysis === * We've updated the dashboards with proper licensing (since it's all open source) so that anyone can use it and / or borrow the code to fix an issue they have [6] [0] https://lists.wikimedia.org/pipermail/discovery/2017-June/001541.html [1] https://meta.wikimedia.org/wiki/Iberocoop/Buenos_Aires_Letter [2] https://meta.wikimedia.org/wiki/Iberocoop:Iberoconf_2017 [3] https://phabricator.wikimedia.org/T153856 [4] https://phabricator.wikimedia.org/T162061 [5] https://phabricator.wikimedia.org/T166731 [6] https://phabricator.wikimedia.org/T167930 --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2017-06-12
by Chris Koerner 20 Jun '17

20 Jun '17

Hello, Another weekly update from Discovery! == Highlights == * A recent update to the search results page on all wikis - sister project snippets - was deployed into production on June 15; see the email for more info. [0] [1] * Added a note to the Extension:Kartographer page about mapframe deployments [2] * Sent out a communication about what the Discovery team's goals and future work will be. [3] == Discussions == === Search === * Logstash scripts are now using curator, and some standard action files (enabling / disabling shard allocation) have been deployed [4] * Deployed new versions of Wikimedia and other ElasticSearch plugins (epic task with lots of smaller subtasks [5] * Various updates to getting the search clusters up to ElasticSearch 5.3.2 [6] [7] [8] [9] * Fixed an issue where the sister project snippets were causing an weird display problem [10] * We've updated Ukrainian-language wikis with a new Ukrainian language analyzer, which should provide better search results by recognizing related forms of a word. (An example in English would be that searching for "hope", "hoped", "hopes", or "hoping" can all find each other.) [11] * We've updated Chinese-language wikis using a new Chinese language analyzer, which should provide better search results by doing a better job of breaking up Chinese text into words, and by automatically converting between Simplified and Traditional characters when searching. [12] We've updated Swedish-language wikis with a smarter configuration that recognizes å, ä, and ö as distinct letters (and not just variants of a and o). [13] * Setup testing, training and validation splits for learning to rank machine learning [14] * Worked on calculating the NDCG of click data that feeds the machine learning rank pipeline [15] === Wikidata Query Service === * Enabled the Mediawiki Service API which allows interacting with Mediawiki API from SPARQL. [16] * Added more federation endpoints. [17] === Analysis === * Finalized the migration from Vagrant to Puppet configuration for the dashboards [18] Investigated a drop in pageviews and clickthroughs on the Wikipedia.org portal - turns out summer is here [19] * Fixed a minor issue with the desktop and mobile web graphs on the external search dashboard [20] === Interactive === * Achieved some clarity to the phabricator board with priorities and what is in progress, needs to be in the backlog or stalled. [21] [0] https://lists.wikimedia.org/pipermail/discovery/2017-June/001536.html [1] https://phabricator.wikimedia.org/T162276 [2] https://www.mediawiki.org/wiki/Help:Extension:Kartographer#Discovery_Maps_U… [3] https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes/Upda… [4] https://phabricator.wikimedia.org/T166154 [5] https://phabricator.wikimedia.org/T160948 [6] https://phabricator.wikimedia.org/T163703 [7] https://phabricator.wikimedia.org/T163708 [8] https://phabricator.wikimedia.org/T167636 [9] https://phabricator.wikimedia.org/T149006 [10] https://phabricator.wikimedia.org/T167301 [11] https://phabricator.wikimedia.org/T160106 [12] https://phabricator.wikimedia.org/T158203 [13] https://phabricator.wikimedia.org/T160562 [14] https://phabricator.wikimedia.org/T162311 [15] https://phabricator.wikimedia.org/T166585 [16] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual/MWAPI [17] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Federation [18] https://phabricator.wikimedia.org/T161354 [19] https://phabricator.wikimedia.org/T167822 [20] https://phabricator.wikimedia.org/T167850 [21] https://phabricator.wikimedia.org/tag/interactive-sprint/ --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

← Newer
1
...
22
23
24
25
26
27
28
...
76
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery