Discovery November 2016

discovery@lists.wikimedia.org

17 participants
17 discussions

Tabbed interface
by Svetlana Tkachenko 30 Nov '16

30 Nov '16

Hi all, In response to [0] I am considering volunteering to develop the tabbed search interface [2] [3]. To me it looks like more logical, more familiar to users compared to the other interfaces. I'm Gryllida at Wikimedia sites. I have prior Perl and JavaScript experience interacting with the MediaWiki API [1], but none in PHP. The JavaScript things I wrote are rather scattered; I have only minimal understanding of objects and modules as I only wrote subroutine style scripts before. At home, I use a GNU/Linux Debian desktop. So this week I came to IRC and asked several questions to get an idea of what the Discovery team is doing. Thanks Deborah for sharing the current state of things! :-) I appear to realize that the tabbed interface is in the plans and nobody is working on it, so it's good to take. We had left some questions unanswered. Particularly, is the Labs instance at [4] expected to be used for all ideas at once or only for one at a time, and is it shared between several people? Is it a good idea for me to use a Labs instance at initial development stages or only when the code is nearing completion? Or is it better to use a Vagrant instance locally? Or both? What documentation and code do you recommend me to read? May I develop it as an extension as much as possible and not a gadget, so that people don't have to wait for page JavaScript to finish loading before they see the new sister wiki tabs? May I please ask someone to volunteer mentoring me throughout the project? (I am in the UTC+11 timezone at present; 'gry' nickname at chat.freenode.net.) Regards, Svetlana. [0]: https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2016-November/00… [1]: http://svetlana.nfshost.com/fs/ [2]: https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design… [3]: https://wikitech.wikimedia.org/wiki/User:Gryllida/sandbox [4]: https://phabricator.wikimedia.org/T151344

5 4

Discovery's mission and roadmap
by Dan Garry 30 Nov '16

30 Nov '16

Hello! I've been working a presentation <https://docs.google.com/presentation/d/1ctlqdLA__0OxDuO7mJEIDLP-xt9a7E4jv4I…> that gives a summary of who Discovery is, what our mission is, and what's coming up for the rest of the year. I'd like to share it all with you! This presentation is a living document. The content and style can and will change over time, perhaps even drastically. This is especially true for the roadmap slide. I made it clear in the presentation, but it's worth pointing out again.:-) If there are any questions, I'd be happy to answer them! Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

3 5

Discovery Weekly Update for the week starting 2016-11-14
by Chris Koerner 19 Nov '16

19 Nov '16

Hello, A few updates from the Discovery department this week. Programming note: There will be no status update next week due to the holiday. [0] ==Highlights== * BM25 is now enabled on the ten wikis with largest traffic. This improves numerous relevance issues in search queries. [1] * The Interactive Team's roadmap was finalized in Nov 2016 for FY 2016/2017. [2] * You can now search for file properties such as file size and and file type on Commons. See the documentation for more information. [3] ==Discussions== * The Interactive and Analysis team had a good discussion on KPIs and moving forward with logging what is needed [4] * Lots of great analysis and discussions about using BM25 on languages with spaces, [5] === Search === * BM25 is now enabled on the ten wikis with largest traffic. This improves numerous relevance issues in search queries. [6] * You can now search for file properties such as file size and and file type on Commons. [7] * Fixed issue where search would behave unexpectedly if the query contained the name of a namespace and a colon. [8] [9] === Analysis === * New Analysis: From Zero To Hero 2: Electric Boogaloo - or - how does stripping out question marks improve search [10] [11] [0] https://en.wikipedia.org/wiki/Thanksgiving [1] https://en.wikipedia.org/wiki/Okapi_BM25 [2] https://www.mediawiki.org/wiki/File:Interactive_Roadmap_2016-2017.pdf [3] https://www.mediawiki.org/wiki/Help:CirrusSearch#File_properties_search [4] https://phabricator.wikimedia.org/T149834#2804436 [5] https://phabricator.wikimedia.org/T147500 [6] https://phabricator.wikimedia.org/T147508 [7] https://phabricator.wikimedia.org/T144447 [8] https://phabricator.wikimedia.org/T148344 [9] https://phabricator.wikimedia.org/T150232 [10] https://phabricator.wikimedia.org/T147216 [11] https://commons.wikimedia.org/wiki/File:Query_Features_and_Search_Performan… ---- The full update, and archive of past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or Volunteer needed in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Machine generated search relevance labels from click through data
by Erik Bernhardson 18 Nov '16

18 Nov '16

Recently I've been doing some investigation into how we can collect enough data to plausibly train an ML model for search re-ranking. As with all ML training, the labeled dataset to train against is an important piece. Many approaches seem to use human labeled relevance, and we have a platform for collecting this data which has proven to have decent predictive capabilities for offline tests of changes to our search. But the amount of data necessary for training ML models is simply not there. In my research i've come across a paper "A Dynamic Bayesian Network Click Model for Web Search Ranking"[1] and related implementation[2] that seems to have some promise. Machine generation of relevance labels seems promising, because i can collect a reasonable amount of information about clickthroughs and the search results that were provided to users. For one week of enwiki traffic i have ~20k queries that were issued by more than 10 identities (~distinct search session). This has around 135k distinct (query, identity) pairs, 140k distinct (query, identity, click page id) pairs, 414k distinct (query, result page id) pairs, and covers ~3M results (~20 per page) that were shown to users and could be converted into relevance judgements. I'm not sure which way to train the final model on though, the 414k distinct (query, result_page_id) pairs, or the 3M which has duplicates from the 414k representing the same (query, result_page_id) pair being shown multiple times. I was also curious about a part in the appendix of the paper, labeled Confidence. It states: Remember that the latent variables au and su will later be used as targets > for learning a ranking function. It is thus important to know the > confidence associated with these values > Why is it important to know the confidence, and how does that play into training a model? This is probably basic ML stuff but I'm new to all of this. And finally, are there better ways of generating relevance labels from clickthrough data, ideally with open source implementations? This is just something I happened to stumble upon in my research and certainly not the only thing out there. [1] http://www2009.eprints.org/1/1/p1.pdf [2] https://github.com/varepsilon/clickmodels

1 0

[Announcement] Search improvements for file properties
by Chris Koerner 16 Nov '16

16 Nov '16

Hello, You can now search for file properties such as file size and and file type on Commons. This includes file media type, MIME type, size, width, height, resolution, and bit depth. A few quick examples: * A search for 'shark' videos [0] * A search for 'flower' where the files are 16 bit [1] * A search for 'stars' where the files are PDFs [2] Please see the documentation for more information. [3] [0] https://commons.wikimedia.org/w/index.php?title=Special:Search&profile=defa… [1] https://commons.wikimedia.org/w/index.php?title=Special:Search&profile=defa… [2] https://commons.wikimedia.org/w/index.php?title=Special:Search&profile=defa… [3] https://www.mediawiki.org/wiki/Help:CirrusSearch#File_properties_search Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

7 12

Re: [discovery] [Mediawiki-api] Access to MediaWiki API with 900 RPS
by Erik Bernhardson 15 Nov '16

15 Nov '16

(cc'ing the discovery mailing list, as that team owns both the implementation and operation of search.) I can partially answer this as one of the people responsible for search, but I have to defer to others about API, bots, and such. This would be a noticeable portion of our traffic, for reference: action=opensearch (and generator variants): 1.5k RPS action=query&list=search (and generator variants): 600 RPS all api: 8k RPS (might be a bit higher, this is averaged over an hour) opensearch is relatively cheap, the p95 to our search servers is ~30ms, with p50 at 7ms. So 600 RPS of opensearch traffic wouldn't be too hard on our search cluster. Using action=query is going to be too heavy, the full text searches are computationally more expensive to serve. Might I ask, which wiki(s) would you be querying against? opensearch traffic is spread across our search cluster, but individual wikis only hit portions of it. For example opensearch on en.wikipedia.org is served by ~40% of the cluster, but zh.wikipedia.org (chinese) is only served by ~13%. If you are going to send heavy traffic to zh I might need to adjust those numbers to spread the load to more servers (easy enough, just need to know). Additionally, you mentioned descriptions and keywords. These would not be provided directly by the opensearch api so you might be thinking of using the generator version of it (action=query&generator=prefixsearch) to get the results augmented (ex: /w/api.php?action=query&format=json&prop=extracts&generator=prefixsearch&exlimit=5&exintro=1&explaintext=1&gpssearch=yah&gpslimit=5). I'm not personally sure how expensive that is, someone else would have to chime in. So, from a computational point of view and only with respect to the search portion of our cluster, this seems plausible as long as we coordinate so that we know the traffic is coming. Others will have to chime in about the wider picture. Erik B. On Mon, Nov 14, 2016 at 4:40 PM, Eric Kuo <erickuo(a)yahoo-inc.com> wrote: > Hi, > > This is Eric from Yahoo. My team develops mobile apps for Taiwan and Hong > Kong users. We want to provide wiki description on keywords in our > contents, and we consider using MediaWiki API:OpenSearch and/or API:Query > to achieve this. Our estimated RPS is 900, and we will cache the query > result on our side. We would like to know if there is any concern with > respect to our RPS, and if so, what is the best practice. > > Any comments and suggestions are welcome. Thank you for your time. > > Best regards, > Eric > > _______________________________________________ > Mediawiki-api mailing list > Mediawiki-api(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api > >

2 1

Re: [discovery] [Wikitech-ambassadors] Request for testing / early adoption of new cross-wiki search results
by Deborah Tankersley 14 Nov '16

14 Nov '16

Thanks, Huji, to answer your question - this will be a short series of tests on Wikipedia that will display additional relevant search results across wikis in the same language to a selected number of users that fall into our bucketing schema. Not everyone will see the new results every time if they land on a search results listing page. Cheers, Deb -- deb tankersley Product Manager, Discovery irc: debt Wikimedia Foundation On Thu, Nov 10, 2016 at 6:11 PM, Huji Lee <huji.huji(a)gmail.com> wrote: > I think it would be best if we test it in at least one RTL wiki. I will > mention this in the VP of Persian Wikipedia (FA WP). > > If it is meant to only be shown for select users and has no impact for > others, I am wiling to volunteer myself as a tester. > > Huji > > On Thu, Nov 10, 2016 at 6:03 PM, Deborah Tankersley < > dtankersley(a)wikimedia.org> wrote: > >> Hello, >> >> The Discovery Search team is looking for a few language specific >> Wikipedia sites that would be interested in helping with A/B testing for >> cross-wiki search results. These tests would evaluate if adding search >> results across wiki projects in the same language would be useful, >> relevant, and are of interest to users. >> >> We've written up the details >> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements> [1], >> came up with a multitude of designs >> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design> >> [2], and had many conversations on both talk pages and with our own >> internal Design team. We have also outlined the initial tests >> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testing> [3] >> that we'd like to run. >> >> These planned A/B tests would run for about a week and would only be >> shown to a small subsection of users that visit the Wikipedia(s) that the >> tests are running on. The analyzed results of these tests will be posted on >> wiki so that everyone can see how they did in terms of usage and adoption >> of the test group. >> >> We would like to know if there are any particular Wikipedias that would >> want to help us test these new search results across projects in their >> language. Interested community members might want to post something to >> their project's Village Pump to build consensus. Wikipedias that are >> related culturally or linguistically would also be of interest. >> >> Please post on our testing talk page >> <https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Result_Improvements/T…> >> [4] if there are any questions, concerns, or volunteers! >> >> Thanks! >> >> >> [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements >> [2] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_ >> Improvements/Design >> [3] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_ >> Improvements/Testing >> [4] https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Re >> sult_Improvements/Testing >> >> -- >> deb tankersley >> Product Manager, Discovery >> irc: debt >> Wikimedia Foundation >> >> _______________________________________________ >> Wikitech-ambassadors mailing list >> Wikitech-ambassadors(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors >> >> > > _______________________________________________ > Wikitech-ambassadors mailing list > Wikitech-ambassadors(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors > >

2 2

Discovery Weekly Update for the week starting 2016-11-07
by Chris Koerner 14 Nov '16

14 Nov '16

Hello, Here is the Discovery status update for the week starting 07 November. Feedback and questions are welcome. ==Highlights== * Many older search tickets that were in the backlog were resolved this week - due to work being completed previously. ==Discussions== * Opened a discussion with the Wikipedia Ambassadors community requesting volunteer wikis that want to be part of upcoming A/B tests for cross-wiki search results. [0] === Search === * Double suggestion when searching on wikipedia with limit/offset [1] * EPIC: Review current ElasticSearch configuration, and use relevance lab to run tests to optimise the configuration to improve search result relevance [2] * CirrusSearch should do something helpful if the search does not return enough results [3] * CirrusSearch: No highlighted text returned from `intitle:` phrase searches [4] * Improve searches on poor spellings [5] * Add extra breakdowns to dashboards, e.g. by country, by language [6] * CirrusSearch: More search results when narrowing down search term [7] * "San Lorenzo (quartiere di Napoli)" not first match when searching the words in different order [8] * Implement a new fulltext query [9] * Poorly tuned rankings [10] * opensearch: Querying for "Big" or "Big!" should include "Big" or "Big!" as first suggestion [11] * "Ignoring nonexistent page" that does exist [12] '''Search tickets to be released in next week's train:''' * Options for Completion Suggester misaligned when description uses more than one line [13] * Image search by file size - will be live when Commons is reindexed [14] === Analysis === * Verified data pipeline for BM25 AB test - ja, zh, th [15] * [Dashboard][Search] Make monthly metrics module work again [16] * Investigate if Interactive logging schema makes sense [17] === Portal === * ''crickets while waiting for code reviews'' === Interactive === * <maplink> does not work in Monobook skin [18] * <maplink> does not work on "Modern" skin [19] * Maps "align" attribute definitions, review and fix [20] '''Interactive tickets to be released in next week's train:''' * Full screen map does not hide sidebar in "Cologne Blue" skin [21] === Interactive Ops === * Ensure Postgres maps have proper indexes [22] == Other Noteworthy Stuff == * Guys in bouncing pink suits [23] [0] https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2016-November/00… [1] https://phabricator.wikimedia.org/T149269 [2] https://phabricator.wikimedia.org/T125603 [3] https://phabricator.wikimedia.org/T55799 [4] https://phabricator.wikimedia.org/T94874 [5] https://phabricator.wikimedia.org/T1240 [6] https://phabricator.wikimedia.org/T102896 [7] https://phabricator.wikimedia.org/T103289 [8] https://phabricator.wikimedia.org/T138996 [9] https://phabricator.wikimedia.org/T128073 [10] https://phabricator.wikimedia.org/T134340 [11] https://phabricator.wikimedia.org/T60309 [12] https://phabricator.wikimedia.org/T69869 [13] https://phabricator.wikimedia.org/T149982 [14] https://phabricator.wikimedia.org/T78490 [15] https://phabricator.wikimedia.org/T147496 [16] https://phabricator.wikimedia.org/T149752 [17] https://phabricator.wikimedia.org/T149838 [18] https://phabricator.wikimedia.org/T145521 [19] https://phabricator.wikimedia.org/T150148 [20] https://phabricator.wikimedia.org/T150063 [21] https://phabricator.wikimedia.org/T150149 [22] https://phabricator.wikimedia.org/T138277 [23] http://i.imgur.com/sd15NVu.gifv ---- The full update, and archive of past updates, can be found on MediaWiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates Interested in getting involved? See tasks marked as Easy or Volunteer needed in Phabricator. [1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Request for testing / early adoption of new cross-wiki search results
by Deborah Tankersley 12 Nov '16

12 Nov '16

Hello, The Discovery Search team is looking for a few language specific Wikipedia sites that would be interested in helping with A/B testing for cross-wiki search results. These tests would evaluate if adding search results across wiki projects in the same language would be useful, relevant, and are of interest to users. We've written up the details <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements> [1], came up with a multitude of designs <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design> [2], and had many conversations on both talk pages and with our own internal Design team. We have also outlined the initial tests <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testing> [3] that we'd like to run. These planned A/B tests would run for about a week and would only be shown to a small subsection of users that visit the Wikipedia(s) that the tests are running on. The analyzed results of these tests will be posted on wiki so that everyone can see how they did in terms of usage and adoption of the test group. We would like to know if there are any particular Wikipedias that would want to help us test these new search results across projects in their language. Interested community members might want to post something to their project's Village Pump to build consensus. Wikipedias that are related culturally or linguistically would also be of interest. Please post on our testing talk page <https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Result_Improvements/T…> [4] if there are any questions, concerns, or volunteers! Thanks! [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements [2] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design [3] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testing [4] https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Result_Improvements/T… -- deb tankersley Product Manager, Discovery irc: debt Wikimedia Foundation

2 3

Upgrading WDQS hardware
by Dan Garry 10 Nov '16

10 Nov '16

Hi all, In order to support the continued growth and use of the Wikidata Query Service, Discovery <https://www.mediawiki.org/wiki/Wikimedia_Discovery> and Technical Operations <https://www.mediawiki.org/wiki/Wikimedia_Technical_Operations> are upgrading the hardware of the service. We're adding a single extra server to both our Virginia (eqiad <https://wikitech.wikimedia.org/wiki/Eqiad_cluster>) and Dallas (codfw <https://wikitech.wikimedia.org/wiki/Codfw_cluster>) data centres. These servers will be dedicated to the Wikidata Query Service. We're expecting this upgrade to take a few months, and we'll keep you updated as things progress. Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery November 2016