Discovery February 2016

discovery@lists.wikimedia.org

29 participants
34 discussions

Clickthrough rates of various positions within the search results

by Erik Bernhardson

I remembered that we started collecting this data a few weeks ago, so ran a query to see what it looks like. While somewhat expected, i found the results rather surprising still. The top 3 dominate click through's, as the literature suggests. There is also a very strong drop-off going to the second page (position 21). Our users do not use search pagination. This data is an 0.5% sample of desktop users clicking through on results shown in the Special:Search page. It does not include clickthroughs from the autocomplete. +----------------+--------+---------+ | event_position | count | percent | +----------------+--------+---------+ | 1 | 154150 | 64.26% | | 2 | 34214 | 14.26% | | 3 | 16213 | 6.76% | | 4 | 9687 | 4.04% | | 5 | 5963 | 2.49% | | 6 | 3912 | 1.63% | | 7 | 3073 | 1.28% | | 8 | 1985 | 0.83% | | 9 | 1720 | 0.72% | | 10 | 1276 | 0.53% | | 11 | 1214 | 0.51% | | 12 | 924 | 0.39% | | 13 | 681 | 0.28% | | 14 | 832 | 0.35% | | 15 | 720 | 0.30% | | 16 | 633 | 0.26% | | 17 | 509 | 0.21% | | 18 | 598 | 0.25% | | 19 | 634 | 0.26% | | 20 | 744 | 0.31% | | 21 | 49 | 0.02% | | 22 | 60 | 0.03% | | 23 | 34 | 0.01% | | 24 | 34 | 0.01% | | 25 | 34 | 0.01% | +----------------+--------+---------+

8 years, 2 months

FW: [Wikitech-l] Prefix search refactoring

by Pine W

Thank you for the update, Stas. Forwarding to the Discovery mailing list. Pine On Feb 12, 2016 01:06, "Stas Malyshev" <smalyshev(a)wikimedia.org> wrote: > Hi! > > In order to make prefix search better, and to bring all variants of > prefix search under one roof, we did some refactoring in the search > engine implementation, so that various prefix searches now use the same > code path and all use the SearchEngine class. > > The changes are as follows: > > SearchEngine gets the following new API functions: > > * public function completionSearch( $search ) - implements prefix > completion search, returns SearchSuggestionSet > * public function completionSearchWithVariants( $search ) - implements > prefix completion search including variants handling, returns > SearchSuggestionSet. > * public function defaultPrefixSearch( $search ) - basic prefix search > without fuzzy matching, etc., to be used in scenarios like special pages > search, etc. Returns Title[]. > > The implementation does not have to implement all three methods > differently, they can all use the same code if needed. > > The default implementation still supports the PrefixSearchBackend hook > but we plan to deprecate it, and the CirrusSearch implementation does > not use it anymore. Instead, there is a private function, protected > function completionSearchBackend( $search ), which implementations > (including CirrusSearch) should implement to provide search results. > > SearchEngine implementations can make use of services provided by the > base SearchEngine including: > > - namespace resolution and normalization. The > PrefixSearchExtractNamespace hook is still supported for engines wishing > to implement namespace lookup not featured in the standard implementation. > - fetching titles for result sets (the implementing engine does not have > to fetch titles from DB for suggestions) > - result reordering to ensure exact matches are on top > - basic prefix search implementation using the database > - Special: namespace search implementation > > == Deprecations == > We plan to deprecate the PrefixSearchBackend hook and classes > TitlePrefixSearch and StringPrefixSearch. We will keep those classes > around for basic search fallback implementation and for old extensions, > but no new code should be using these classes, instead they should use > SearchEngine APIs described above. Mediawiki code has already been fixed > to do that. Extensions implementing search engines should also extend > SearchEngine and override the APIs above. CirrusSearch is the example of > how to do it. > > == Show me the code == > The patches implementing the refactoring are linked from: > https://phabricator.wikimedia.org/T121430 > > Pretty version of the same: > https://www.mediawiki.org/wiki/User:Smalyshev_(WMF)/Suggester > > If you have questions on this, please contact the Discovery team: > https://www.mediawiki.org/wiki/Wikimedia_Discovery#Communications > -- > Stas Malyshev > smalyshev(a)wikimedia.org > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

8 years, 2 months

suggester mail for wikitech

by Stas Malyshev

Hi! I plan to send a note to wikitech about SearchEngine prefix completion refactoring we did for suggester recently. Here's how it would look like: https://www.mediawiki.org/wiki/User:Smalyshev_(WMF)/Suggester Please review and suggest fixes/additions. -- Stas Malyshev smalyshev(a)wikimedia.org

8 years, 2 months

FedEx Fast talker

by Trey Jones

In the unMeeting Dan did an impressive impromptu bit of double-speed speaking (not DoubleSpeak, though—that would be double plus ungood). Here's the FedEx commercial I mentioned right at the end. This guy had his 15 minutes of fame doing these commercials: https://www.youtube.com/watch?v=NeK5ZjtpO-M —Trey

8 years, 2 months

Reworked Portal Changes Article - Feedback requested

by Chris Koerner

After some feedback from the blog folks I have an updated draft to share. Please take a look and provide any corrections. I think it tells the bigger story of changes coming to production, and explains our research quite well (not the other way around!). I simplified the language, made it about 100 words shorter, put the big news at the top, and removed passive voice and adverbs. https://meta.wikimedia.org/wiki/User:CKoerner_(WMF)/Work/A-B_Test_Results_1 Here's a diff <https://meta.wikimedia.org/w/index.php?title=User%3ACKoerner_%28WMF%29%2FWo…> of my slew of changes. -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

8 years, 2 months

TTL on morelike

by Adam Baso

Hi, does anyone happen to know the rough TTL for morelike results? For edge cache performance of Related Articles we're thinking to set smaxage / RESTbase to cache responses for 24 hours as a sort of happy medium. -Adam

8 years, 2 months

latest news / trending articles on Wiki sites: Edward Saperia chat

by Deborah Tankersley

Hello all, I had a recently had a lovely chat with Ed Saperia, a community member working on projects related to discovering news in Wikipedia, to let him know what the Discovery Team is about and what we are doing with the Wikipedia Portal page <https://www.wikipedia.org>. Ed is working on a recommender algorithm that will provide a sortable listing of news, so that users of his algorithm can help make Wikipedia a source of news for users and readers. It's meant to be open and collaborative, ideally with the codebase existing on wiki like Lua modules. This algorithm, in theory, would be able to reference all metadata (article views, edits, timestamps, etc) and semantic data (categories, Wikidata properties) that are related to each edit. We chatted about how to make his project more informative by using Wikidata and that it'd be a good idea to have sections (or filters) for sports, deaths, celebrities, politics, etc. He'd also like to have info on why the recommended article is there, something like: "This [person/topic] is trending because X number of edits were made in the last 24 hours" or "This [person] is trending because X's [date of death] was added." I showed him a few trending sites that some of our community folks are working on that are somewhat similar: http://top.hatnote.com/ and http://www.trending.eu/en/1/. Those sites don't necessarily show as much rich metadata as Ed's site project hopes to have, but they're still pretty neat to see as trending article sites. Ed and his team of developers will be meeting in a few weeks to work on their project and might offer us a chance to chat with them about this project. I let him know that our team is hoping to launch a Portal Labs project for the community to view at any time and provide feedback on proposed Portal re-designs and enhancements. I think Ed's recommender algorithm project for trending articles would be fun to add in as a sample alternative page! Overall, he's got some very good ideas and I'm excited to see where his project ends up! Cheers, Deb -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation

8 years, 2 months

Relevance Thoughts

by Justin Ormont

Greetings, Moving discussion from irc to email for added transparency and visibility... Previously on irc: *tfinc * 13:45 Deskana: so much really interesting talk about search on https://meta.wikimedia.org/wiki/Talk:2016_Strategy/Reach#NaBUru38 13:46 https://meta.wikimedia.org/wiki/Talk:2016_Strategy/Reach 13:46 less about that specific post and more about the conversations in general 13:46 i see lot of people who could help us test and move with next steps *JustinO* 13:49 that talk is actually what reminded me to check in with you folks and see if you wanted assistance in the relevance area *tfinc* 13:51 JustinO: greetings. we can always use wise guidance and help to make our users and donors proud. what do you have in mind ? *JustinO* 13:52 last year I was talking with a couple of folks after elasticon 13:53 and we were going thru the first steps like which metrics are useful to track *jgirault* 13:54 debt: OuKB: jan_drewniak: besides a varnish issue with images, the page with separate JS file is on beta http://www.wikipedia.beta.wmflabs.org/ *tfinc* 13:55 JustinO: ebernhardson and i will be at this years elasticon 13:56 JustinO: we've been looking at a number of interesting metrics to validate user satisfaction for our search relevance. bearloga can tell you plenty about it *JustinO* 13:57 awesome. i looked thru some of your docs. tracking dwell time is great as it opens up a whole host of useful metrics *ebernhardson* 13:57 JustinO: we almost certainly need help in relevane :) we are currently hitting some very high level things, but we need to to a lot more in terms of collecting and measuring relevance (both from users, and in back testing for new features) to do well moving forward *bearloga* 13:58 JustinO: we're tracking dwell time and clickthrough rate. we hope to get some qualitative user feedback to correlate that with the quantitative data we're tracking *JustinO* 13:58 with that you can infer good clicks vs. bad clicks. which leads to a session success rate, time to success, etc. and in the long run gives you a training set to do offline evaluations and in the long term, machine learned rankers *jgirault* 13:59 the deploy-to-prod patch would be: https://gerrit.wikimedia.org/r/268804 *tfinc* 13:59 JustinO: Trey314159 has worked a bit on creating a base line relevance lab to do offline evaluations between different ranking/sorting/etc algorithms *JustinO* 14:00 @*bearloga*: one simple way of qualitative feed back is the simple "how was you search today?" message *jan_drewniak* 14:01 jgirault: like someone once said, the hardest things in programming are cache invalidation and naming things *JustinO* 14:01 @tfink: offline evals are very useful. creating a hand generated judgment set with cleans labels takes time but pays off *ebernhardson* 14:01 we also do track which position the user clicked, in addition to dwell time. But i don't think we are doing anything with that information yet *bearloga* 14:02 JustinO: the question we're going to ask is basically that but we're working on rolling out that feedback system *jgirault* 14:02 jan_drewniak: and choosing between spaces and tabs *JustinO* 14:04 *ebernhardson*: i think i was suggesting tracking {query, all results, position clicked, dwell time on the clicked page, userid, time from from pageload to click} *jgirault* 14:04 alright, so I'm gonna head to the office now. Once I get there, I'll try to find someone to push that to prod. Meanwhile, if you have time jan_drewniak you can sanity check the latest master *Trey314159* 14:04 JustinO: Hey! Sorry Dan (Deskana) and I haven't gotten back to your email yet. It's been a busy week, and there's a lot of stuff but not a lot of context to that email thread. *JustinO* 14:04 @*Trey314159*: no worries *Trey314159* 14:04 Fortunately, James outlined your conversation: https://meta.wikimedia.org/wiki/Schema_talk:Search#Useful_metrics_to_track 14:05 (For anyone else who wants to take a look) *ebernhardson* 14:05 JustinO: interesting, i think we are collecting most of those, but not the all results or the user id. We do collect a token that is a short-term proxy for the user id though *JustinO* 14:05 *an anonymous token for the id is great* *ebernhardson* 14:05 JustinO: i'm curious, by all results you mean (in our case) a list of page titles or id's? *Ironholds* 14:05 JustinO, can I ask you move this to the mailing list or email myself or bearloga? We can explain what we're already tracking, what we're planning on tracking, and you can chip in feedback *ebernhardson* 14:05 i hadn't thought of that, but it makes sense *JustinO* 14:05 @*ebernhardson* : pageids i suppose, i'm not sure what's best for wikimedia *Ironholds* 14:06 at the moment this is kind of duplicative because you don't know what we're tracking in advance of suggesting we track it ;p *ebernhardson* 14:06 the current schema is here: https://meta.wikimedia.org/wiki/Schema:TestSearchSatisfaction2 14:06 the descriptions could be better, but give a general idea *JustinO* 14:07 *ebernhardson: session id is prob fine for a userid unless you want to get towards personalization in the long run. eg: give coders more pages related to tech* *Trey314159* 14:07 Ironholds: to be fair, JustinO suggested we track it long before we actually did (early last year).. but I agree this might be a better conversation on the mailing list, definitely including Ironholds and bearloga, and not late on a Friday afternoon (local time for me, at least) *Ironholds* 14:07 JustinO, yep, we've tested session IDs. We know these things ;p *Ironholds* 14:08 let's chat on the mailing lists where conversations can be seen by other users/helpers for transparency purposes, and we can be async to avoid time drains *JustinO* 14:08 yeah, i'm assuming you've put lots of thought into the topics *Ironholds* 14:09 https://lists.wikimedia.org/mailman/listinfo/discovery for reference *JustinO* 14:09 yep *Ironholds* 14:10 (our mailing list infrastructure makes it a nightmare to find anything. I just use google ;p) *JustinO* 14:10 i maybe on there *Ironholds* 14:10 (...appropriate for the discovery team I guess) *bearloga* 14:10 chuckles *ebernhardson* 14:10 Ironholds: while i don't expect it will make it into prod (change is hard) there is a test instance is discourse that could plausibly replace mailling lists and be more discoverable 14:11 https://discourse.wmflabs.org/ *Ironholds* 14:11 cool! --justin

8 years, 2 months

Paper "Discovery of Topical Authorities in Instagram"

by Pine W

Just passing this along in case anyone is interested. The abstract mentions the use of Wikipedia: "We infer regular users' interests from their self-reported biographies that are publicly available and use Wikipedia pages to ground these interests as fine-grained, disambiguated concepts." https://research.facebook.com/publications/discovery-of-topical-authorities… Pine

8 years, 2 months

Request for 1 or 2 lightning talks about Q3 Discovery projects

by Pine W

Hi Discoverers, If it's possible, I'd love to hear more about some of the Q3 Discovery projects during the February lightning talks. ( https://www.mediawiki.org/wiki/Lightning_Talks#February_2016) Any of these topics would be interesting: * Relevance of intra-wiki search * Models for user satisfaction with search * Wikipedia.or portal improvements * OSM tiling * WDQS Thanks! Pine

8 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery February 2016