I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.

In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).

One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?

On Mon, Mar 7, 2016 at 6:05 AM, Trey Jones <tjones@wikimedia.org> wrote:

Thanks Erik for summarizing the discussion so far.

The very last sentence got cut off:

But yes it's a huge engineering task with a lot of challenges :/ It's also

I think I know what was next:

... a fun engineering task with many new things to learn! :)

Even if that wasn't the next bit, it's still true.

On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson <ebernhardson@wikimedia.org> wrote:
This thread started off list, but I'm hoping all of you watching along can help us along to brainstorm and improve search satisfaction. Note that these aren't all my thoughts, they are a conglomeration of thoughts (many copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's also why this might not all read like one person wrote it.

A few weeks ago I attended ElasticON and there was a good presentation about search satisfaction by Paul Nelson. One of the things he thought was incredibly important, that we had already been thinking about but hadn't moved forward enough on, was generating an Engine Score. This week Paul held an online webinar where he gave the same presentation but without such strict time constraints which Trey attended. You can find my summary of this presentation in last weeks email to this list, 'ElasticON notes'

Some things of note:
He doesn't like the idea of golden corpora—but his idea is different from Trey's. He imagines a hand-selected set of "important" queries that find "important" documents. I don't like that either (at least not by itself). I always imagine a random selection of queries for a golden corpus.
He lumps BM25 in with TF/IDF and calls them ancient and unmotivated and from the 80s and 90s. David's convinced us that BM25 is a good thing to pursue. Of course, part of Search Technologies' purpose is to drum up business, so they can't say, "hey just use this in Elastic Search" or they'd be out of business.
He explains the mysterious K factor that got all this started in the first place. It controls how much weight changes far down the results list carry. It sounds like he might tune K based on the number of results for every query, but my question about that wasn't answered. In the demo, he's only pulling 25 results, which Erik's click-through data shows is probably enough.
He mentions that 25,000 "clicks" is a good enough sized set for measuring a score (and having random noise come out in the wash). Not clear if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.

David and Trey talked about this some, and Trey think's the idea of Paul's metric (Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID]) has a lot of appeal. It's based on clicks and user sessions, so we'd have to be able to capture all the relevant information and make it available somewhere to replay in Relevance Forge for assessment. We currently have a reasonable amount of clickthrough data collected from 0.5% of desktop search sessions that we can use for this task. There are some complications though because this is PII data and so has to be treated carefully.

Mikhail's goal for our user satisfaction metric is to have a function that maps features including dwell time to user satisfaction ratio. (e.g., 10s = 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The predictive model is going to include a variety of features of varying predictive power, such as dwell time, clickthrough rate, engagement (scrolling), etc. One problem with the user satisfaction metric is that it isn't replayable. We can't re-run the queries in vitro and get data on what users think of the new results. However it does play into Nelson's idea, discussed in the paper and maybe in the video, of gradable relevance. Assigning a user satisfaction score to a given result would allow us to weight various clicks in his metric rather than treating them all as equal (though that works, too, if it's all you have).

We need to build a system that we are able to tune in an effective way.
As pointed by Trey cirrus does not allow us to tune the core similarity function params. David tend's to think that we need to replace our core similarity function with a new one that is suited for optimizations and BM25 allows it, there are certainly others and we could build our own. But the problem will be:

How to tune these parameters in an effective way, with BM25 we will have 7 fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune 3 params : weight, k1, and b for each field
- weight is likely to range between 0 and 1 with maybe 2 digits precision steps
- k1 from 1 to 2
- b from 0 to 1
And I'm not talking about the query independent factors like popularity, pagerank & co that we may want to add. It's clear that we will have to tackle hard search performance problems...

David tend's to think that we need to apply an optimization algorithm that will search for optimal combination according to an objective. David doesn't think we can run such optimization plan with A/B testing, it's why we need a way to replay a set of queries and compute various search engine scores.
We don't know what's the best approach here:
- extract the metrics from the search satisfaction schema that do not require user intervention (click and result position).
- build our own set of queries with the tool Erik is building (temporary location: http://portal.wmflabs.org/search/index.php)
-- Erik thinks we should do both, as they will give us completely different sets of information. The metrics about what our users are doing is a great source of information provides a good signal. The tool Erik is building comes at the problem from a different direction, sourcing search results from wiki/google/bing/ddg and getting humans to rate which results are relevant/not relevant on a scale of 1 to 4. This can be used with other algorithms to generate an independent score. Essentially I think the best Relevance Forge will output a multi-dimensional engine score and not just a single number.
-- We should set up records of how this engine score changes over days, months, and longer, so we can see a rate of improvement (or lack thereof. But hopefully improvement :)

And in the end will this (BM25 and/or searching with weights per field) work?
- not sure, maybe the text features we have today are not relevant and we need to spend more time on extracting relevant text features from the mediawiki content model (https://phabricator.wikimedia.org/T128076)
but we should be able to say : this field has no or only bad impact impact.

The big picture would be:
- Refactor cirrus in a way that everything is suited for optimization
- search engine score: the objective (Erik added it as goal)
- Optimization algorithm to search/tune the system params. Trey has prior experience working within optimization frameworks. Mikhail also has relevant machine learning experience.
- A/B testing with advanced metrics to confirm that the optimization found good combination

With a framework like that we could spend more time on big impact text features (wikitext, synonyms, spelling correction ...).
But yes it's a huge engineering task with a lot of challenges :/ It's also

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery