Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge

7 Mar 2016

I've put together some initial patches to our Relevance Forge over the
weekend to move this forward. Relevance Forge can now source queries and
click through data from the satisfaction schema to replay against an
elasticsearch cluster. It calculates Paul's engine score (i need a better
name than engine score for that...) with a single CirrusSearch config, or
it can apply a grid search over a multi-dimensional set of CirrusSearch
config values (numbers only for now) to try and find the optimal values for
some parameters.

In the first couple times testing this it looks to (unsurprisingly) heavily
prefer our current results. I say unsurprisingly because the satisfaction
schema shows ~87% of clicks being within the top 3 results. Additionally
Paul's engine score is meant to work best grouping by users over some
length of time, where we only have ~10 minute sessions which contain an
average of 2 searches per session. Because of this I have also implemented
nDCG calculation within Relevance Forge (was relatively easy), but
unfortunately we do not have the data to plug into this equation to
generate a score. Queries being PII can make this a bit difficult, but I'm
sure we can find a reasonable way. I demo's a POC at last week's CREDIT
showcase that makes collecting these scores less of a chore, but it needs
to be expanded.  There are, at least, two main things that have to happen
before we can start calculating nDCG. I really like Justin's idea to pull
in results from other search engines to be scored, this would allow us to
improve recall and see a positive improvement in the nDCG score. We also
have to either get a set of queries whitelisted through legal so we can run
this in labs and have our own scoring + assistence from the community, or
we have to build up a database of queries and distribute it to team members
so they can rate the relevance of queries on their local machine, and then
aggregate those databases back together (a pain...there must be a better
way).

One issue i see is that neither of these metrics, Paul's engine score or
nDCG, directly deal with measuring the impact of fixing our recall
problems, although perhaps i misread how the idealized DCG is supposed to
work. Pulling in results from other search engines will help, so at least
improved recall wont result in a reduced score, but I was thinking maybe we
could calculate an alternate idealized DCG which uses the highest rated
documents for a query, rather than just resorting the result set by the
relevance score?

On Mon, Mar 7, 2016 at 6:05 AM, Trey Jones &lt;tjones(a)wikimedia.org&gt; wrote:

...
  Thanks Erik for summarizing the discussion so far.

 The very last sentence got cut off:

 But yes it's a huge engineering task with a lot of challenges :/ It's also

 I think I know what was next:

 ... a fun engineering task with many new things to learn! :)

 Even if that wasn't the next bit, it's still true.

 On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson <
 ebernhardson(a)wikimedia.org&gt; wrote:

  This thread started off list, but I'm hoping
all of you watching along
 can help us along to brainstorm and improve search satisfaction. Note that
 these aren't all my thoughts, they are a conglomeration of thoughts (many
 copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's
 also why this might not all read like one person wrote it.

 A few weeks ago I attended ElasticON and there was a good presentation
 about search satisfaction by Paul Nelson.  One of the things he thought was
 incredibly important, that we had already been thinking about but hadn't
 moved forward enough on, was generating an Engine Score. This week Paul
 held an online webinar where he gave the same presentation but without such
 strict time constraints which Trey attended. You can find my summary of
 this presentation in last weeks email to this list, 'ElasticON notes'

 Some things of note:

    - He doesn't like the idea of golden corpora—but his idea is
    different from Trey's. He imagines a hand-selected set of "important"
    queries that find "important" documents. I don't like that either (at
least
    not by itself). I always imagine a random selection of queries for a golden
    corpus.
    - He lumps BM25 in with TF/IDF and calls them ancient and unmotivated
    and from the 80s and 90s. David's convinced us that BM25 is a good thing to
    pursue. Of course, part of Search Technologies' purpose is to drum up
    business, so they can't say, "hey just use this in Elastic Search" or
    they'd be out of business.
    - He explains the mysterious K factor that got all this started in
    the first place. It controls how much weight changes far down the results
    list carry. It sounds like he might tune K based on the number of
    results for every query, but my question about that wasn't answered. In the
    demo, he's only pulling 25 results, which Erik's click-through data shows
    is probably enough.
    - He mentions that 25,000 "clicks" is a good enough sized set for
    measuring a score (and having random noise come out in the wash). Not clear
    if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.

 David and Trey talked about this some, and Trey think's the idea of
 Paul's metric (Σ power(FACTOR, position) * isRelevant[user,
 searchResult[Q,position].DocID]) has a lot of appeal. It's based on
 clicks and user sessions, so we'd have to be able to capture all the
 relevant information and make it available somewhere to replay in Relevance
 Forge for assessment. We currently have a reasonable amount of clickthrough
 data collected from 0.5% of desktop search sessions that we can use for
 this task. There are some complications though because this is PII data and
 so has to be treated carefully.

 Mikhail's goal for our user satisfaction metric is to have a function
 that maps features including dwell time to user satisfaction ratio. (e.g.,
 10s = 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The
 predictive model is going to include a variety of features of varying
 predictive power, such as dwell time, clickthrough rate, engagement
 (scrolling), etc. One problem with the user satisfaction metric is that
 it isn't replayable. We can't re-run the queries in vitro and get data on
 what users think of the new results. However it does play into Nelson's
 idea, discussed in the paper and maybe in the video, of gradable relevance.
 Assigning a user satisfaction score to a given result would allow us to
 weight various clicks in his metric rather than treating them all as equal
 (though that works, too, if it's all you have).

 We need to  build a system that we are able to tune in an effective way.
 As pointed by Trey cirrus does not allow us to tune the core similarity
 function params. David tend's to think that we need to replace our core
 similarity function with a new one that is suited for optimizations and
 BM25 allows it, there are certainly others and we could build our own. But
 the problem will be:

 How to tune these parameters in an effective way, with BM25 we will have
 7 fields with 2 analyzers : 14 internal lucene fields. BM25 allows to
 tune 3 params : weight, k1, and b for each field
 - weight is likely to range between 0 and 1 with maybe 2 digits precision
 steps
 - k1 from 1 to 2
 - b from 0 to 1
 And I'm not talking about the query independent factors like popularity,
 pagerank & co that we may want to add. It's clear that we will have to
 tackle hard search performance problems...

 David tend's to think that we need to apply an optimization algorithm
 that will search for optimal combination according to an objective. David
 doesn't think we can run such optimization plan with A/B testing, it's why
 we need a way to replay a set of queries and compute various search engine
 scores.
 We don't know what's the best approach here:
 - extract the metrics from the search satisfaction schema that do not
 require user intervention (click and result position).
 - build our own set of queries with the tool Erik is building (temporary
 location: http://portal.wmflabs.org/search/index.php)
 -- Erik thinks we should do both, as they will give us completely
 different sets of information. The metrics about what our users are doing
 is a great source of information provides a good signal. The tool Erik is
 building comes at the problem from a different direction, sourcing search
 results from wiki/google/bing/ddg and getting humans to rate which results
 are relevant/not relevant on a scale of 1 to 4. This can be used with other
 algorithms to generate an independent score. Essentially I think the best
 Relevance Forge will output a multi-dimensional engine score and not just a
 single number.
 -- We should set up records of how this engine score changes over days,
 months, and longer, so we can see a rate of improvement (or lack thereof.
 But hopefully improvement :)

 And in the end will this (BM25 and/or searching with weights per field)
 work?
 - not sure, maybe the text features we have today are not relevant and we
 need to spend more time on extracting relevant text features from the
 mediawiki content model (https://phabricator.wikimedia.org/T128076)
   but we should be able to say : this field has no or only bad impact
 impact.

 The big picture would be:
 - Refactor cirrus in a way that everything is suited for optimization
 - search engine score: the objective (Erik added it as goal)
 - Optimization algorithm to search/tune the system params. Trey has prior
 experience working within optimization frameworks. Mikhail also has
 relevant machine learning experience.
 - A/B testing with advanced metrics to confirm that the optimization
 found good combination

 With a framework like that we could spend more time on big impact text
 features (wikitext, synonyms, spelling correction ...).
 But yes it's a huge engineering task with a lot of challenges :/ It's
 also

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge