Discovery April 2017

discovery@lists.wikimedia.org

15 participants
9 discussions

New map style preview
by Paul Norman 25 Apr '17

25 Apr '17

As part of https://phabricator.wikimedia.org/T153282 a new style for Wikimedia maps is being developed, and I've loaded up the whole planet on one of my test servers as a test and demo. The demo is available at http://legolas.paulnorman.ca:6789/, and through "Compare" on the right-hand side of the interface you can compare it with the current Wikimedia style, OpenStreetMap Carto, and lots of others. Some other things to be aware of when comparing are: - The map is displayed with Kosmtik, a design tool with minimal caching, and it might be restarted while I'm working on it - Even though the server is faster than production, it may appear slower because it doesn't have everything cached - The OSM data on the server is normally within a day of the latest data Some of the more noticeable style changes are - Road colours are different, helping view the overall layout of the city - There are fewer cases of subtly different shades of green. - Bridges and multi-level road constructions are now handled properly, which should make some areas easier to figure out I am particularly interested in feedback on - the overall colour darkness and intensity, - which of city, region, and country labels are most important: https://phabricator.wikimedia.org/T163503 Feedback is welcome, either through email, phab tickets, or by IRC in #wikimedia-interactive on freenode.

3 4

Discovery Weekly Update for the week starting 2017-04-10
by Chris Koerner 21 Apr '17

21 Apr '17

Hello, This is the weekly update for the week starting 2017-04-10 going through the week of 2017-04-21 == Highlights == * A blog post about Discovery's Search team was released, along with a Twitter and Facebook post, detailing what we've done over the last several months and what we have coming up next. Numerous tests have been done with real users to get their feedback on the sister project snippets and the upcoming explore similar feature. [0] [1] [2] == Discussions == ===Search === * Finished a write up based a discussion Trey & David had about the math of scoring functions. Use your hyperoperations, kids! [3] ===Analysis === * Fixed a bug on the portal dashboard, and updated the search dashboard with an option to view how the relative share of each search engine changes over time [4] [5] * Updated the external search dashboard to display non-bot traffic [6] [7] * Updated the WDQS dashboard to include the SPARQL endpoints [8] === Portal === * A scap3 bug was fixed (task T161832) that had been blocking portal deployments and thus, we were able to deploy several updates for RTL display [9] [10] [11] * Portal statistics and translations were updated [12] [13] * Multiple other minor enhancements were made to the css, colors of the portal page [0] https://blog.wikimedia.org/2017/04/10/searching-wikipedia/ [1] https://twitter.com/Wikimedia/status/851506674200346624 [2] https://www.facebook.com/wikipedia/posts/10155070205158346 [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Some_Thoughts_on_the… [4] https://phabricator.wikimedia.org/T161806 [5] https://phabricator.wikimedia.org/T161771 [6] https://discovery.wmflabs.org/external/ [7] https://phabricator.wikimedia.org/T161932 [8] https://discovery.wmflabs.org/wdqs/ [9] https://phabricator.wikimedia.org/T122053 [10] https://phabricator.wikimedia.org/T160429 [11] https://phabricator.wikimedia.org/T160002 [12] https://phabricator.wikimedia.org/T128546 [13] https://phabricator.wikimedia.org/T142582 --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Search engine rankings for pages from Wikipedia and sister sites
by Pine W 20 Apr '17

20 Apr '17

Hi Discovery, Over the past few years, my anecdotal impression is that search results from Wikipedia have become less and less prominent when I use major web search engines. I'm aware that Discovery is working on internal search features including cross-project search, and that WMF people working on readership are trying to increase the dwell time and number of pages that Wikipedia visitors spend on Wikipedia. Has anyone analyzed trends for web search engine rankings of Wikipedia articles, particularly over the last few years? Also, is anyone analyzing what would be required to increase the rankings of Wikipedia articles (and information from sister sites, such as Wikisource and Commons) when people use web search engines? Thanks, Pine

5 5

Fwd: [Wiki-research-l] Project exploring automated classification of article importance
by David Causse 19 Apr '17

19 Apr '17

Forwarding to the discovery mailing as the outcome of this research might be extremely valuable for search. ---------- Forwarded message ---------- From: Morten Wang <nettrom(a)gmail.com> Date: Wed, Apr 19, 2017 at 1:17 AM Subject: [Wiki-research-l] Project exploring automated classification of article importance To: Research into Wikimedia content and communities < wiki-research-l(a)lists.wikimedia.org> Hello everyone, I am currently working with Aaron Halfaker and Dario Taraborelli at the Wikimedia Foundation on a project exploring automated classification of article importance. Our goal is to characterize the importance of an article within a given context and design a system to predict a relative importance rank. We have a project page on meta[1] and welcome comments or thoughts on our talk page. You can of course also respond here on wiki-research-l, or send me an email. Before moving on to model-building I did a fairly thorough literature review, finding a myriad of papers spanning several disciplines. We have a draft literature review also up on meta[2], which should give you a reasonable introduction to the topic. Again, comments or thoughts (e.g. papers we’ve missed) on the talk page, mailing list, or through email are welcome. Links: 1. https://meta.wikimedia.org/wiki/Research:Automated_ classification_of_article_importance <https://meta.wikimedia.org/wiki/Research:Automated_ classification_of_article_importance> 2. https://meta.wikimedia.org/wiki/Research:Studies_of_Importance Regards, Morten [[User:Nettrom]] aka [[User:SuggestBot]] _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

1 0

Discovery Weekly Update for the week starting 2017-04-03
by Chris Koerner 08 Apr '17

08 Apr '17

Hello, Here is this week's update from the Discovery department. As always, feedback and questions are welcome. == Highlights == * Jan presented the Wiktionary widget during the CREDIT showcase [0] * Translations were asked for, in order to post a message on the top language Village Pumps about the upcoming production release of sister projects snippets being shown on the search results pages [1] * Posted about the upcoming sister projects snippets in search results on various Village Pumps and to a few email lists [2] [3] * The Interactive team has asked for feedback on a new map style [4] [5] == Discussions == === Analysis === * Fixed an issue with the retrieval scripts not using correct data on the portal dashboard [6] * Removed regex in ZRR breakdown by type on the Search dashboard [7] [8] === Portal === * We got some help fixing a deployment bug on the Portals - yay! [9] == Did you know? == * Amberjack served sashimi style is pretty good! [10] [0] https://www.youtube.com/watch?v=Jn_3CT6GR9o [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/villag… [2] https://phabricator.wikimedia.org/T162064#3161941 [3] https://phabricator.wikimedia.org/T162064#3161951 [4] https://lists.wikimedia.org/pipermail/maps-l/2017-April/001565.html [5] https://phabricator.wikimedia.org/T153282 [6] https://phabricator.wikimedia.org/T162178 [7] https://discovery.wmflabs.org/metrics/#failure_breakdown [8] https://phabricator.wikimedia.org/T161876 [9] https://phabricator.wikimedia.org/T161832 [10] https://commons.wikimedia.org/wiki/File:Amberjack_fish_served_sashimi_style… --- The archive of all past updates can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

3 3

Another round of name that thing
by Erik Bernhardson 07 Apr '17

07 Apr '17

We seem to have some consensus that for the upcoming learning to rank work we will build out a python library to handle the bulk of the backend data plumbing work. The library will primarily be code integrating with pyspark to do various pieces such as: # Sampling from the click logs to generate the set of queries + page's that will be labeled with click models # Distributing the work of running click models against those sampled data sets # Pushing queries we use for feature generation into kafka, and reading back the resulting feature vectors (the other end of this will run those generated queries against either the hot-spare elasticsearch cluster or the relforge cluster to get feature scores) # Merging feature vectors with labeled data, splitting into test/train/validate sets, and writing out files formatted for whichever training library we decide on (xgboost, lightgbm and ranklib are in the running currently) # Whatever plumbing is necessary to run the actual model training and do hyper parameter optimization # Converting the resulting models into a format suitable for use with the elasticsearch learn to rank plugin # Reporting on the quality of models vs some baseline The high level goal is that we would have relatively simple python scripts in our analytics repository that are called from oozie, those scripts would know the appropriate locations to load/store data and pass into this library for the bulk of the processing. There will also be some script, probably within the library, that combines many of these steps for feature engineering purposes to take some set of features and run the whole thing. So, what do we call this thing? Horrible first attempts: * ltr-pipeline * learn-to-rank-pipeline * bob * cirrussearch-ltr * ???

10 18

Update on Discovery search efforts and upcoming releases
by Deborah Tankersley 06 Apr '17

06 Apr '17

tl;dr: Search continues to expand functionality by displaying more information on the search results page Ever started searching for something on Wikipedia and wondered—*really*, is that all that there is? Does it feel like you’re somehow playing hide and seek with all the knowledge that’s out there? And...wouldn’t it be great to see articles or categories that are similar to your search query and maybe some related images or links to other languages in which to read that article? Or, maybe you just want to read and contribute to projects other than Wikipedia but need a jump start with a few short summaries from sister projects. The Discovery Search team has been testing out some really cool new features that will enable some fun and fascinating clicking—down the rabbit hole of Wikipedia.[1] But first, let’s recap what we’ve been doing recently. We've been doing tons of work creating, updating, and finessing the search back end to enhance search queries. There have been many complex things that have happened, things like: adding ascii-folding and stemming, detecting when a visitor might be typing in a language that is different than the Wikipedia that they are on, switching from tf-idf to BM25, dropping trailing question marks, and updating to ElasticSearch version 5. [2][3][4][5][6][7] Whew! We have much more planned in the coming months—machine learning with ‘learning to rank’, investigating and deploying new language analyzers, and, after exhaustive analysis, removing quotes within queries by default.[8][9][10][11] We’ll also be working closely with the new Structured Data team in their brand new work on Commons.[12][13] We also want to improve the part that our readers and editors interface with: the search results page! We started brainstorming during the late summer of 2016 on what we could do to make search results better—to easily find interesting, relevant content and to create a more intuitive viewing experience.[14] We designed and refined numerous ideas on how to improve the search results page and received lots of good feedback from the community.[15] Empowered by the feedback, we began testing starting with a display of results from the Wikimedia sister projects next to the regular search results.[16] The idea for this test was to enable discovery into other projects—projects that our visitors might not have known about—by displaying interesting results in small snippets. The sidebar display of the sister projects borrows from a similar feature in use on the Italian, Catalan and French Wikipedias. We've run two A/B tests on the sister project search results with detailed analysis and, after a bit of final touches to the code, we will release the new functionality into production on all Wikipedias near the end of April 2017. Our next A/B test will be to add additional information and related results for each search query. This will be in the form of an ‘explore similar’ link that, when someone interacts with the link, an expanded display will appear with related pages, categories and links to the article in other languages—all of which might lead to further knowledge discovery.[17] We know that not every search query will return exactly what folks were looking for, but we feel that adding links to similar, but related information would be helpful and, possibly, super interesting! We also plan on doing a few more A/B tests in the coming year: * Test a new display that will show the pronunciation of a word with its definition and part of speech—all from existing data in Wiktionary. Initially this will be in English only. * Test placing a small image (from the article) next to each search result that is displayed on the page. * Test an additional future using a new auto completion metadata display in the search box that is located on the top right of most pages in Wikipedia, similar to what happens on the Wikipedia.org portal.[18] For the more technical minded, there is a way to test out these new features in your own browser. To display the sister project search results, it will require a bit of URL manipulation; but for the explore similar and Wiktionary widget, you can modify your common.js file to test an early version of the features. Detailed information is available on MediaWiki.org.[19] Once the testing, analysis and feedback cycle is done for each new feature, we’d like to slowly implement them into production on all Wikipedias throughout the rest of the year. We’re really hoping that these enhancements to how search works will further the usefulness of search and make our readers and editors more productive. Cheers from the Discovery Search team! [1] https://xkcd.com/214/ [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/R e-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia [3] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ [4] https://en.wikipedia.org/wiki/Tf%E2%80%93idf [5] https://en.wikipedia.org/wiki/Okapi_BM25 [6] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Drop ping_Final_Question_Marks_in_the_Top_10_Wikipedias [7] https://phabricator.wikimedia.org/T154501 [8] https://en.wikipedia.org/wiki/Learning_to_rank [9] https://phabricator.wikimedia.org/T154511 [10] https://commons.wikimedia.org/wiki/File:From_Zero_to_ Hero_-_Anticipating_Zero_Results_From_Query_Features,_Ignoring_Content.pdf [11] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/ Quotes_and_Questions [12] https://commons.wikimedia.org/wiki/Commons:Structured_data [13] https://blog.wikimedia.org/2017/01/09/sloan-foundation-structured-data/ [14] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements [15] https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_ Result_Improvements [16] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result _Improvements/Testing#A.2FB_test:_Add_cross-wiki_search_ results_in_a_right_hand_sidebar [17] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result _Improvements/Testing#A.2FB_test:_Add_.27explore_similar. 27_pages_and_categories_for_search_results [18] https://www.wikipedia.org/ [19] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result _Improvements/self-guided_testing -- deb tankersley irc: debt Product Manager, Discovery Wikimedia Foundation

1 0

Re: [discovery] [AI] Another round of name that thing
by Erik Bernhardson 05 Apr '17

05 Apr '17

On Wed, Apr 5, 2017 at 12:55 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote: > Link to code? > > No code yet, although there is proof of concept code which this will inform this work at stat1002.eqiad.wmnet:/a/ebernhardson/spark_feature_log/code > "ltr" means "left to right" to me. Maybe you could do something like > "ltrank" > > Sounds like LTR is out as the term is already used elsewhere and is more widely known. LTRank isn't a bad compromise with spelling out the whole thing. > On Wed, Apr 5, 2017 at 2:28 PM, Erik Bernhardson < > ebernhardson(a)wikimedia.org> wrote: > >> We seem to have some consensus that for the upcoming learning to rank >> work we will build out a python library to handle the bulk of the backend >> data plumbing work. The library will primarily be code integrating with >> pyspark to do various pieces such as: >> >> # Sampling from the click logs to generate the set of queries + page's >> that will be labeled with click models >> # Distributing the work of running click models against those sampled >> data sets >> # Pushing queries we use for feature generation into kafka, and reading >> back the resulting feature vectors (the other end of this will run those >> generated queries against either the hot-spare elasticsearch cluster or the >> relforge cluster to get feature scores) >> # Merging feature vectors with labeled data, splitting into >> test/train/validate sets, and writing out files formatted for whichever >> training library we decide on (xgboost, lightgbm and ranklib are in the >> running currently) >> # Whatever plumbing is necessary to run the actual model training and do >> hyper parameter optimization >> # Converting the resulting models into a format suitable for use with the >> elasticsearch learn to rank plugin >> # Reporting on the quality of models vs some baseline >> >> The high level goal is that we would have relatively simple python >> scripts in our analytics repository that are called from oozie, those >> scripts would know the appropriate locations to load/store data and pass >> into this library for the bulk of the processing. There will also be some >> script, probably within the library, that combines many of these steps for >> feature engineering purposes to take some set of features and run the whole >> thing. >> >> So, what do we call this thing? Horrible first attempts: >> >> * ltr-pipeline >> * learn-to-rank-pipeline >> * bob >> * cirrussearch-ltr >> * ??? >> >> >> _______________________________________________ >> AI mailing list >> AI(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/ai >> >> > > _______________________________________________ > AI mailing list > AI(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/ai > >

1 0

Elasticsearch - datacenter switch
by Guillaume Lederrey 05 Apr '17

05 Apr '17

Hello teams! After some discussion with David, we realised that Cirrus / Elasticsearch switch is already more automated than we realised. Cirrus is configured to talk the local Elasticsearch cluster. So if we start serving traffic for Mediawiki from codfw, those mediawiki instances should contact the Elasticsearch codfw cluster. We do have the ability to change that configuration and for the use of a specific cluster. That's what we did during the previous datacenter switch, and what we already do for some maintenance operations (yes, major upgrades of Elasticsearch do require downtime, so we use codfw during those upgrades). Since we have already tested a manual DC switch quite a few times, it is time to check if this automatic switch is working as it should. The only downside is that it increases the number of moving parts during the Mediawiki switch. On a last note, lots of thanks and praise to David and Erik who did think ahead much more than I did and implemented those nice features! Have fun! Guillaume -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery April 2017