Discovery January 2016

discovery@lists.wikimedia.org

21 participants
25 discussions

QuickSurveys for Discovery in Q3
by Dan Garry 25 Jan '16

25 Jan '16

As many of you are aware, Discovery wants to run a QuickSurvey <https://www.mediawiki.org/wiki/Extension:QuickSurveys> in Q3 to ask users if they're satisfied with search results. A requirement of this is that we can tie the survey responses to our search schema and satisfaction metric, so that we can correlate responses with the data to figure out how effective our metric actually is at measuring search satisfaction. Adam, Julien, and I had a brief chat today. We agreed that our goal is to be able to tie the data together by whatever means necessary, i.e. not necessarily by changing QuickSurveys if it's easier a different way. Adam mentioned that QuickSurveys records mw.user.sessionId, which may be suitable and persistent enough that we could tie our data together if we added that to our search logging. Obviously, there are other stakeholders to talk to (Erik, Oliver) and questions to resolve; Julien wants to have a meeting with Erik and Oliver next week. Overall, this is good news. :-) Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

5 9

Dashboards backlogged
by Oliver Keyes 25 Jan '16

25 Jan '16

Hey all, Just writing to let people know that we expect the dashboards to be backlogged for a few days. This is due to ongoing EventLogging maintenance at the Analytics Engineering end of things, which should be neatly tidied up by Tuesday-ish. I'll let y'all know if I get more information or a more refined idea of when it'll be done :) -- Oliver Keyes Count Logula Wikimedia Foundation

1 0

Probabilistic User Satisfaction?
by Trey Jones 23 Jan '16

23 Jan '16

Yesterday in the quarterly review Dan mentioned that our current user satisfaction metric uses the somewhat arbitrary 10s dwell time cutoff for a successful search, and that we want to use a survey to correlate qualitative and quantitative values to pin down a better cutoff for our users. I don't remember whether Dan mentioned it, or I was just rehashing the notion on my own, but it may be difficult to pin down a specific cutoff. A wild thought appears! Why do we have to pin down a specific cut off? Why can't we have a probabilistic user satisfaction metric? (Other then complexity and computational speed, which may be relevant.) We have the ability to gather so much data that we could easily compute something like this: 20% of users are satisfied when dwell time is <5s, 35% for 5-10s, 75% for 10-60s, 98% for 1m-5m, 85% for 5m-20m, and 80% for >20m. Determining the cutoffs might be tricky, and computation is more complex than counting, but not ridiculously complicated, and potentially much more accurate for large samples. Presenting the results is still easy: "54.7% of our users are happy with their search results based on our dwell-time model". I tried to do a quick search for papers on this topic, but I didn't find anything. I'm not familiar with the literature, so that may not mean much. Okay, back to the TextCat mines.... —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

5 5

How Wrong Would Using Out of Date Page View Data Be?
by Trey Jones 21 Jan '16

21 Jan '16

What a long, strange trip it's been. Full write up here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/How_Wrong_Would_Usin… Summary: - We can't reliably catch day-by-day outliers by using the page view information that comes along with edits because not enough edits happen. - Weekly averages (rather than day-by-day counts) don't usually move that much (i.e., by more than a factor of 2). If we can capture daily or weekly page view stats, that should keep us reasonably up-to-date overall, esp. if these moderate swings don't affect scoring much. - We could gather daily statistics from the page view API and store the high mark over the last 3-7 for the top 1K to 50K most-viewed articles. The ranking algorithm could use either the rolling daily average or the high mark (which ever is higher). - For "Trending" topics, looking at the top 1K page views every hour (unfortunately not currently available through the PageviewAPI) would be the best way to catch suddenly trending topics if we want to be more responsive, but it isn't clear that it's worth it. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

1 0

Tracking remaining technical work on the completion suggester for fuller rollout
by Dan Garry 21 Jan '16

21 Jan '16

Hello! In the standup, I mentioned that we should collect the remaining blockers with the completion suggester together. Well, it turns out we already have a task <https://phabricator.wikimedia.org/T121616> for it! I slightly repurposed it and tweaked it, and now it's good to go. Please add your technical blockers to that task, so that we have a unified place to track the remaining issues that need to be resolved for a more full rollout. Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

Weird dashboard values
by Oliver Keyes 19 Jan '16

19 Jan '16

Hey all, You might notice some weird spikes in the dashboard data. This is caused by some duplicated days worth of data due to the stop-and-start of services we're dependent on. We expect to have it resolved by the end of the day. Thanks! -- Oliver Keyes Count Logula Wikimedia Foundation

6 10

Phabricator project renamed
by Oliver Keyes 15 Jan '16

15 Jan '16

Just a heads-up that we've renamed the UX sprint board in phabricator to the Portal sprint board, since, well, that's what we're using it for ;p. It can now be found at https://phabricator.wikimedia.org/tag/discovery-portal-sprint/ Thanks! -- Oliver Keyes Count Logula Wikimedia Foundation

2 1

Redandancy heads-up
by Oliver Keyes 15 Jan '16

15 Jan '16

Heyo, I had a 1:1 with Dan today and asked what I could do to take some work off his plate/push us forward. The response was that I should take the ideas I had in the portal standup and work them into phabricator tickets. This does not equate to 'we will work on these things' or anything else other than 'this is an idea that has been logged' - deciding we should work on a thing is for Deborah :). So if you see cards flying past, comment away to refine the ideas but don't worry that there's a process fail going on. Thanks, -- Oliver Keyes Count Logula Wikimedia Foundation

1 0

Phonetic indexing
by Trey Jones 15 Jan '16

15 Jan '16

For some reason today I wanted to look up Mikhail Baryshnikov. It's been a while so I forgot how to spell his last name. I didn't try very hard, and I got no enwiki result. Google, of course, found the correct spelling, which I then used on enwiki. Since I used to do name searching and matching, this gave me an idea, which generalizes beyond just names. For every article title (and maybe each redirect—we could look into that) we could generate a phonetic index[1] and store those in a special EalasticSearch index. (We could look at storing multiple phonetic indexes for better recall, possibly generated by multiple algorithms; some, like Double Metaphone, generate multiple index by themselves.) Then, under certain circumstances (say, zero results and no suggestion from any other source, or no result with a score above a certain cutoff, or too few results, etc.), we could make a suggestion and/or show results based on matching phonetic index plus some score (say, a mix of page views and page rank, or whatever scoring we've got going on). So, when some doofus (hey, that's me!) comes along and searches for "borishnakoff" (worse than what I actually searched for), we could correct to *baryshnikov* (there's page with that title) or give *Mikhail Baryshnikov* as a result (likely the top scoring item with the same phonetic index in the title), or something similar. Other algorithms exist (and can be devised) for languages other than English, so the maximally fleshed out version of this would offer a choice of phonetic indexing algorithms, but I get ahead of myself. *Has anyone looked into this kind of phonetic indexing for enwiki, Wikipedia in general, or other wikimedia projects before?* I have some additional thoughts on how to test the effectiveness of phonetic indexing on zero results for enwiki without having to fully implement everything if the index sounds like something we could afford to build. Thoughts? —Trey [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an example, you drop non-initial vowels and duplicate letters, and collapse letters that tend to sound alike, while taking into account orthographic conventions like sh, ch, th, initial kn- or pt-, etc. So both *baryshnikov* and *borishnakoff* are likely to come out something like BRXNGV. Trey Jones Software Engineer, Discovery Wikimedia Foundation

4 7

Exporting SPARQL query logs?
by Stas Malyshev 14 Jan '16

14 Jan '16

Hi! I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically: 1. Can we do it at all - technically, legally, privacy-wise? (note we're talking about SPARQL query text only, no other information to be provided) 2. Are there any considerations why we may want *not* to do it even if we could? 3. How hard would it be to make such export and do we have any existing infrastructure that should be used for this? All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev(a)wikimedia.org

6 11

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery January 2016