Discovery June 2017

discovery@lists.wikimedia.org

10 participants
11 discussions

Puppetized Discovery Dashboards and Shiny Server module for Puppet
by Mikhail Popov 21 Aug '17

21 Aug '17

Howdy, Happy to report that production[1] and development[2] sets of Discovery Dashboards are up and running again, this time managed by Puppet. (There was a bug with web proxies and DNS settings that delayed this announcement.) Theoretically they should be snappier to use now because there is no longer an extra virtualization (Vagrant) layer and they are running directly on Labs instances. R is a software and programming language mainly used for statistical inference, machine learning, and data wrangling & visualization. RStudio's Shiny[3] is a framework for developing web applications in R, and it's what Discovery's dashboards are written in. The Reading::Discovery::Analysis team (with guidance and help from Guillaume Lederrey) is proud to announce a new module available in Ops' Puppet repo: shiny_server[4], which installs & configures RStudio's Shiny Server[5] for serving R/Shiny applications. The module also provides resources for installing R packages from CRAN, GitHub, and other remote git repositories like Gerrit. For a practical example, refer to Discovery Dashboards base[6] and production[7] profiles. Cheers, Mikhail on behalf of Discovery Analysts [1] https://discovery.wmflabs.org [2] https://discovery-beta.wmflabs.org/ [3] https://shiny.rstudio.com/ [4] https://github.com/wikimedia/puppet/tree/production/modules/shiny_server [5] https://www.rstudio.com/products/shiny/shiny-server/ [6] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifes… [7] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifes…

1 1

Discovery Weekly Update for the week starting 2017-06-26
by Chris Koerner 30 Jun '17

30 Jun '17

Hello everyone, Here's the weekly update for Discovery and the last update in June. Can you believe it‽ As always, feedback and questions are welcome. ==Discussions== === Search === * Evaluated several training set sizes for the learning-to-rank models [0] * Completed several successful load tests on the servers for learning-to-rank query plugin [1] * Fixed a bug with second try searching using language detection [2] * Updated CirrusSearch to not depend on spaces to activate proximity rescoring (phrase rescore) [3] * Explore similar pages, categories and languages A/B test was activated on June 29, 2017 [4] === Analysis === * Updated the dashboards polloi package that now allows country names and assigning of ASCII region names [5] === Portal === * Statistics and translations were updated on the Wikipedia.org portal page on June 27, 2017 [6] [7] === Interactive === * A bug where coordinates were being passed incorrectly to external map services was fixed in production [8] * Discussion and a patch uploaded to update Kartographer so that jQuery 3 doesn't break things [9] [0] https://phabricator.wikimedia.org/T168664 [1] https://phabricator.wikimedia.org/T169002 [2] https://phabricator.wikimedia.org/T168302 [3] https://phabricator.wikimedia.org/T152094 [4] https://phabricator.wikimedia.org/T149809 [5] https://phabricator.wikimedia.org/T167913 [6] https://phabricator.wikimedia.org/T128546 [7] https://phabricator.wikimedia.org/T142582 [8] https://phabricator.wikimedia.org/T160782 [9] https://phabricator.wikimedia.org/T168744 -- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison Wikimedia Foundation

1 0

Elasticsearch slowdown
by Guillaume Lederrey 30 Jun '17

30 Jun '17

Hello! We've had a significant slowdown of elasticsearch today (see Grafana for exact timing [1]). The impact was low enough that it probably does not require a full incident report (the number of errors did not raise significantly [2]), but understanding what happened and sharing that understanding is important. This is going to be a long and technical email, you might get bored, feel free to close it and delete it right now. TL;DR: elastic1019 was overloaded, having too many heavy shards, banning all shards from elastic1019 to reshuffle allowed it to recover. In more details: elastic1019 was hosting shards for commonswiki, enwiki and frwiki, which are all high load shards. elastic1019 is one of our older server, which are less powerfull, and might also suffer from CPU overheating [3]. The obvious question: "why do we even allow multiple heavy shards to be allocated on the same node?". The answer is obvious as well: "it's complicated...". One of the very interesting feature of elasticsearch is its ability to automatically balance shards. This allows the cluster to automatically rebalance in case nodes are lost, and to automatically balance shards to spread resource usage across all nodes in the cluster [4]. Constraints can be added to account for available disk space [5], rack awareness [6], or even have specific filtering for specific indices [7]. It does not directly allow to constraint allocation based on the load of a specific shard. We do have a few mechanism to ensure that load is as uniform as possible on the cluster: An index is split in multiple shards, a shard is replicated multiple times to provide redundancy and to spread load. Those are configured by index. We know which are the heavy indices (commons, enwiki, frwiki, ...), both in term of size and in term of traffic. Those indices are split in a number of shards+replicas close to the number of nodes in the cluster, to ensure that those shards are spread evenly on the cluster, with only a few shards of the same index on the same node, but still allow to loose a few nodes and keep all shards allocated. For example, enwiki_content has 8 shards, with 2 replicas each, so a total number of 24 shards, with a maximum of 2 shards on the same node. This approach works well most of the time. The limitation is that a shard is a "scalability unit", you can't move around something smaller than a shard. In the case of enwiki, a single shard is ~40Go and a fairly large number of requests per second. If you have a node that has just one more of those shards, that's already a significant amount of additional load. The solution could be to split large indices in a lot more shards, the scalability unit would be much smaller, and it would be much easier to have a uniform load. Of course, there are also limitations. The total number of shards in the cluster has a significant cost. Increasing it will add load to cluster operations (which are already quite expensive in with the total number of shards we have at this point). There are also functional issues: ranking (BM25) uses statistics calculated per shard, with smaller shards at some point the stats might not be relevant of the whole corpus. There are probably a lot more detail we could get into, feel free to ask more questions and we can continue the conversation. And I'm sure David and Erik have a lot to add! Thanks for reading to the end! Guillaume [1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=… [2] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=… [3] https://phabricator.wikimedia.org/T168816 [4] https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allo… [5] https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-alloca… [6] https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-… [7] https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-alloc… -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

2 2

Discovery Weekly Update for the week starting 2017-06-19
by Chris Koerner 26 Jun '17

26 Jun '17

Hello, A few quick updates form the Discovery department this week. Feedback and questions are welcome. == Highlights == * Mikhail sent out an email describing the cool new backend update that all the dashboards got - from Vagrant to Puppet [0] == Discussions == * A letter from some of the attendees of Iberoconf 2017 detailing requests to the Foundation for more and better communications and translations. [1] [2] * The Analysis team started discussing about adding continuous integration configuration to all wikimedia/discovery analytics repositories [3] === Search === * Erik did a lot of intensive research evaluating the libraries for machine learning ranking with LambdaRank and settled on xgboost running over yarn [4] * We are having a lively discussion about how best to utilize the Kuromoji Japanese language analyzer [5] === Analysis === * We've updated the dashboards with proper licensing (since it's all open source) so that anyone can use it and / or borrow the code to fix an issue they have [6] [0] https://lists.wikimedia.org/pipermail/discovery/2017-June/001541.html [1] https://meta.wikimedia.org/wiki/Iberocoop/Buenos_Aires_Letter [2] https://meta.wikimedia.org/wiki/Iberocoop:Iberoconf_2017 [3] https://phabricator.wikimedia.org/T153856 [4] https://phabricator.wikimedia.org/T162061 [5] https://phabricator.wikimedia.org/T166731 [6] https://phabricator.wikimedia.org/T167930 --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2017-06-12
by Chris Koerner 20 Jun '17

20 Jun '17

Hello, Another weekly update from Discovery! == Highlights == * A recent update to the search results page on all wikis - sister project snippets - was deployed into production on June 15; see the email for more info. [0] [1] * Added a note to the Extension:Kartographer page about mapframe deployments [2] * Sent out a communication about what the Discovery team's goals and future work will be. [3] == Discussions == === Search === * Logstash scripts are now using curator, and some standard action files (enabling / disabling shard allocation) have been deployed [4] * Deployed new versions of Wikimedia and other ElasticSearch plugins (epic task with lots of smaller subtasks [5] * Various updates to getting the search clusters up to ElasticSearch 5.3.2 [6] [7] [8] [9] * Fixed an issue where the sister project snippets were causing an weird display problem [10] * We've updated Ukrainian-language wikis with a new Ukrainian language analyzer, which should provide better search results by recognizing related forms of a word. (An example in English would be that searching for "hope", "hoped", "hopes", or "hoping" can all find each other.) [11] * We've updated Chinese-language wikis using a new Chinese language analyzer, which should provide better search results by doing a better job of breaking up Chinese text into words, and by automatically converting between Simplified and Traditional characters when searching. [12] We've updated Swedish-language wikis with a smarter configuration that recognizes å, ä, and ö as distinct letters (and not just variants of a and o). [13] * Setup testing, training and validation splits for learning to rank machine learning [14] * Worked on calculating the NDCG of click data that feeds the machine learning rank pipeline [15] === Wikidata Query Service === * Enabled the Mediawiki Service API which allows interacting with Mediawiki API from SPARQL. [16] * Added more federation endpoints. [17] === Analysis === * Finalized the migration from Vagrant to Puppet configuration for the dashboards [18] Investigated a drop in pageviews and clickthroughs on the Wikipedia.org portal - turns out summer is here [19] * Fixed a minor issue with the desktop and mobile web graphs on the external search dashboard [20] === Interactive === * Achieved some clarity to the phabricator board with priorities and what is in progress, needs to be in the backlog or stalled. [21] [0] https://lists.wikimedia.org/pipermail/discovery/2017-June/001536.html [1] https://phabricator.wikimedia.org/T162276 [2] https://www.mediawiki.org/wiki/Help:Extension:Kartographer#Discovery_Maps_U… [3] https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes/Upda… [4] https://phabricator.wikimedia.org/T166154 [5] https://phabricator.wikimedia.org/T160948 [6] https://phabricator.wikimedia.org/T163703 [7] https://phabricator.wikimedia.org/T163708 [8] https://phabricator.wikimedia.org/T167636 [9] https://phabricator.wikimedia.org/T149006 [10] https://phabricator.wikimedia.org/T167301 [11] https://phabricator.wikimedia.org/T160106 [12] https://phabricator.wikimedia.org/T158203 [13] https://phabricator.wikimedia.org/T160562 [14] https://phabricator.wikimedia.org/T162311 [15] https://phabricator.wikimedia.org/T166585 [16] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual/MWAPI [17] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Federation [18] https://phabricator.wikimedia.org/T161354 [19] https://phabricator.wikimedia.org/T167822 [20] https://phabricator.wikimedia.org/T167850 [21] https://phabricator.wikimedia.org/tag/interactive-sprint/ --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Search update: sister project snippets are now in production!
by Deborah Tankersley 17 Jun '17

17 Jun '17

Hello, We're happy to announce that the latest update to the search results page is now in production! We've added snippets from the Wikimedia sister projects into a sidebar display on the search results page to further discovery into even more knowledge using your search query. You can test this new functionality out using your favorite wiki and here are a few quick sample URLs to get you started: - German: https://de.wikipedia.org/w/index.php?search=Wiener+Schnitzel~&title=Special… - Italian: https://it.wikipedia.org/wiki/Special:Search?search=Ricerca&fulltext=1 - Basque: https://eu.wikipedia.org/w/index.php?search=barruti~&title=Berezi:Bilatu&go… - English: https://en.wikipedia.org/w/index.php?search=the+night+is+nigh&title=Special… We had sent an email earlier in the year, in which we described the overall search work that the Discovery team has been doing, if you'd like a refresher on our project.[1] Cheers from the Discovery Search team! [1] https://lists.wikimedia.org/pipermail/discovery/2017-April/001487.html -- deb tankersley irc: debt Product Manager, Discovery Wikimedia Foundation

3 2

Re: [discovery] [Wikitech-l] Search update: sister project snippets are now in production!
by Stas Malyshev 16 Jun '17

16 Jun '17

Hi! > But, no results for Wikidata, the site that covers more topics than all out > other sites? Wikidata search is think right now may not be ready for this yet. It's way more complicated than regular wiki search because it's a) multilingual and b) data and not text. We're working on it though :) -- Stas Malyshev smalyshev(a)wikimedia.org

1 0

Re: [discovery] [MediaWiki-l] Search feature / Change in behaviour
by Pine W 15 Jun '17

15 Jun '17

Cross-posting this to the Discovery mailing list with hopes that someone from WMF Discovery can shed some light on this situation. Pine On Mon, May 15, 2017 at 2:08 PM, Tom <tom(a)hutch4.us> wrote: > I actually think there is a drop in page content results too. Searching > for example, pages using a tag <FooBar>text</FooBar> would report content > found in x pages. Now search <FooBar> no content in pages found. Search > <FooBar no > found on 3 pages but expect 50. > > I do want to do more testing. Rebuilding the index seems to be super fast > unlike before which would take up to a few minutes to complete. > > Tom > > > On May 15, 2017, at 10:02 AM, [[kgh]] <mediawiki(a)kghoffmeyer.de> wrote: > > > > Heiya, > > > > it's me again. :) Does somebody at least see the issue. Probably a bug > > that should be reported? > > > > Thanks and cheers > > > > Karsten > > > > > >> Am 09.05.2017 um 16:32 schrieb [[kgh]]: > >> Heiya, > >> > >> I have upgraded from 1.23 to 1.27 which was now possible since the > >> latest release. > >> > >> After the process I observe a changed behavior regarding the rudimentary > >> full-text search MediaWiki provides out of the box, i.e. I am not > >> talking about the Cirrus/Elastica duo available as an extra. > >> > >> When adding a search term to the search field on MW 1.27 like e.g. > >> "Lorem ipsum" (note: including the ") than only the page names for the > >> findings are shown and not the page names and some text extract wrapping > >> the searched term as MW 1.23 did. When adding just Lorem ipsum (note: > >> excluding the ") I get the page names and some text extract wrapping the > >> searched term as I did with 1.23. The results for Lorem ipsum however > >> are a much worse fit than for "Lorem ipsum" so that's why I am here. > >> > >> Perhaps I missed some setting I now have to make or perhaps there is > >> some script I overlooked to get things running. I'd like to get the > >> wrapping text back. Pointers highly appreciated. > >> > >> Thanks for your time > >> > >> Karsten > >> > >> > >> _______________________________________________ > >> MediaWiki-l mailing list > >> To unsubscribe, go to: > >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > > > > _______________________________________________ > > MediaWiki-l mailing list > > To unsubscribe, go to: > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > _______________________________________________ > MediaWiki-l mailing list > To unsubscribe, go to: > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l >

5 8

Update on Discovery Projects
by Deborah Tankersley 13 Jun '17

13 Jun '17

Hi everybody, (With apologies for cross-posting...) You may have seen the recent communication [1 <https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes>] about the product and tech tune-up which went live the week of June 5th, 2017. In that communication, we promised an update on the future of Discovery projects and we will talk about those in this email. The Discovery team structure has now changed, but the new teams will still work together to complete the goals as listed in the draft annual plan.[2] A summary of their anticipated work, as we finalize these changes, is below. We plan on doing a check-in at the end of the calendar year to see how our goals are progressing with the new smaller and separated team structure. Here is a list of the various projects under the Discovery umbrella, along with the goals that they will be working on: Search Backend Improve search capabilities: - Implement ‘learning to rank’ [3] and other advanced machine learning methodologies - Improve support for languages using new analyzers - Maintain and expand power user search functionality Search Frontend Improve user interface of the search results page with new functionality: - Implement explore similar [4] <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin…> - Update the completion suggester box [5] <https://www.mediawiki.org/wiki/Extension:CirrusSearch/CompletionSuggester> - Investigate the usage of a Wiktionary widget for English Wikipedia [6] Wikidata Query Service Expand and scale: - Improve ability to support power features on-wiki for readers - Improve full text search functionality - Implement SPARQL federation support Portal Create and implement automated language statistics and translation updates for Wikipedia.org Analysis Provide in-depth analytics support: - Perform experimental design, data collection, and data analysis - Perform ad-hoc analyses of Discovery-domain data - Maintain and augment the Discovery Dashboards,[7] which allow the teams to track their KPIs and other metrics Maps Map support: - Implement new map style - Increase frequency of OSM data replication - As needed, assist with individual language Wikipedia’s implementation of mapframe [8] <https://www.mediawiki.org/wiki/Maps/how_to:_embedded_maps> Note: There is a possibility that we can do more with maps in the coming year; we are currently evaluating strategic, partnership, and resourcing options. Structured Data on Commons Extend structured data search on Commons, as part of the structured data grant [9] via: - Research and implement advanced search capabilities - Implement new elements, filters, relationships Graphs and Tabular Data on Commons We will be re-evaluating this functionality against other Commons initiatives such as the structured data grant. As with maps, we will provide updates when we know more. We are still working out all the details with the new team structure and there might be some turbulence; let us know if there are any concerns and we will do our best to answer them. Best regards, Deborah Tankersley, Product Manager, Discovery Erika Bjune, Engineering Manager, Search Platform Jon Katz, Reading Product Lead Toby Negrin, Interim Vice President of Product Victoria Coleman, Chief Technology Officer [1] https://www.mediawiki.org/wiki/Wikimedia_Engineering/June_2017_changes [2] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/… [3] https://en.wikipedia.org/wiki/Learning_to_rank [4] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin… [5] https://www.mediawiki.org/wiki/Extension:CirrusSearch/CompletionSuggester [6] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testin… [7] https://discovery.wmflabs.org/ [8] https://www.mediawiki.org/wiki/Maps/how_to:_embedded_maps [9] https://commons.wikimedia.org/wiki/Commons:Structured_data

1 0

Discovery Weekly Update for the week starting 2017-06-05
by Chris Koerner 12 Jun '17

12 Jun '17

This is the weekly update for the Discovery team. == Discussions == === Search === * Upgraded kibana to elasticsearch 5.3.3 [0] === Analysis === * Fixed an issue where caching update caused stats to no longer get logged on maps dashboard [1] * Used the search event logs to find currently existing namespace combinations [2] === Portal === * Update the stats and translations on the portal [3] [0] https://phabricator.wikimedia.org/T167266 [1] https://phabricator.wikimedia.org/T167083 [2] https://phabricator.wikimedia.org/T165861 [3] https://phabricator.wikimedia.org/T128546#3329983 -- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery June 2017