Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge

8 Mar 2016

Le 07/03/2016 19:01, Erik Bernhardson a écrit :
...
  I've put together some initial patches to our
Relevance Forge over the 
 weekend to move this forward. Relevance Forge can now source queries 
 and click through data from the satisfaction schema to replay against 
 an elasticsearch cluster. It calculates Paul's engine score (i need a 
 better name than engine score for that...) with a single CirrusSearch 
 config, or it can apply a grid search over a multi-dimensional set of 
 CirrusSearch config values (numbers only for now) to try and find the 
 optimal values for some parameters.

 In the first couple times testing this it looks to (unsurprisingly) 
 heavily prefer our current results. I say unsurprisingly because the 
 satisfaction schema shows ~87% of clicks being within the top 3 
 results. Additionally Paul's engine score is meant to work best 
 grouping by users over some length of time, where we only have ~10 
 minute sessions which contain an average of 2 searches per session. 
 Because of this I have also implemented nDCG calculation within 
 Relevance Forge (was relatively easy), but unfortunately we do not 
 have the data to plug into this equation to generate a score. Queries 
 being PII can make this a bit difficult, but I'm sure we can find a 
 reasonable way. I demo's a POC at last week's CREDIT showcase that 
 makes collecting these scores less of a chore, but it needs to be 
 expanded.  There are, at least, two main things that have to happen 
 before we can start calculating nDCG. I really like Justin's idea to 
 pull in results from other search engines to be scored, this would 
 allow us to improve recall and see a positive improvement in the nDCG 
 score. We also have to either get a set of queries whitelisted through 
 legal so we can run this in labs and have our own scoring + assistence 
 from the community, or we have to build up a database of queries and 
 distribute it to team members so they can rate the relevance of 
 queries on their local machine, and then aggregate those databases 
 back together (a pain...there must be a better way).

 One issue i see is that neither of these metrics, Paul's engine score 
 or nDCG, directly deal with measuring the impact of fixing our recall 
 problems, although perhaps i misread how the idealized DCG is supposed 
 to work. Pulling in results from other search engines will help, so at 
 least improved recall wont result in a reduced score, but I was 
 thinking maybe we could calculate an alternate idealized DCG which 
 uses the highest rated documents for a query, rather than just 
 resorting the result set by the relevance score? 
Thanks Erik, your new patch was a revelation! :)

Yes I totally agree, none of these metrics will help us to find the 
features needed to improve recall. If the result is not in the top 20 it 
can't be evaluated by the score.
Paul's score and query clicks grabbed from search logs will help us to 
tune the existing features but won't help us to discover the missing ones.
Both nDCG and Paul's score help to score the ability of the system to 
rank the top 20 results properly but unfortunately we know nothing about 
the results that are out of the top 20: i.e. are the relevant docs found 
or are they missed?
Is it actually a ranking precision problem that results in a bad recall 
in the top 20 or a pure recall problem because the result is not even in 
the whole result set?

Example:
Some users pointed out the inability of the system to weight properly 
words in the title:
On enwiki here is some query where relevant results are not in the top 
20 but are in the whole result set (page 2 or more):
- history france: => History of France
- france middle ages => France in the Middle Ages
- syrian civil war casualties => Casualties of the Syrian Civil War
- cities in the san francisco bay area => List of cities and towns in 
the San Francisco Bay Area
- new york university alumni => List of New York University alumni
And this one which is imho very representative of the problem :
- legend film 2015 => Legend (2015 film)

In these cases our "recall problem" is in reality a ranking precision 
problem where "recall features" (synonyms, spell corrections, 
phonetic...) would not really help (it could be even worse).
All these queries were reported on phabricator or on this list. I'm not 
sure that we could really find a score that helps us to discover such 
problems. These scores are only useful to tune the top-N ranking.
We should maybe start to build a small set of queries that are easily 
mapped to a system feature, more like a simple regression test. This is 
maybe what Paul suggests when he says that the set of queries you chose 
is not necessarily relevant of the usage (search logs and clicks) but 
relevant to the system?

When you say "I was thinking maybe we could calculate an alternate 
idealized DCG which uses the highest rated documents for a query, rather 
than just resorting the result set by the relevance score? "
Do you mean : I want a system that is able to say: if result Y is not 
returned in the top 10 for query X then it's a problem?
A score that acts as a set of constraints, kind of "must-have" results 
for a specific query?
If yes I agree with you and it is what I did naively when playing with 
your patch (see below).

Experiments with your patch:
I've set-up BM25 and a custom query builder to address specifically the 
bad precision regarding words in the title (BM25 was needed because the 
default lucene TD/IDF is very bad and does not allow fine-tuning: score 
range is too big and noisy)
I tried to manually constraint the system with the following queries on 
an enwiki index (http://en-suggesty.wmflabs.org/wiki/Special:Search) :
- kennedy => JFK in the top 10 (feature: boost on incoming links/page views)
- sf novelist (and sf novelist*s*) => List of science fiction authors 
(constraint not to overboost phrase or title)
- all the queries listed above concerning poor title match

Manual tuning was extremely hard and misleading, phrase rescore boost 
was initially 10 (current default) and I had to decrease it to 0.2 (it 
took a lot of my time playing with various weight settings).
Surprisingly automatic tuning with Erik's patch found nearly the same 
conclusion with a totally different query set, preferred phrase boost 
was 0.3 (it took few hours to compute).
Paul's score started from 0.50 and maxed out at 0.58, below is the 
result of 4 iterations where I gradually narrowed the search space:
- https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
- axis: x is the weight of the phrase rescore => best 0.3
- axis: y is the weight of incoming_links => best 1.0
Note that my initial constraints were still respected (and even slightly 
better in some cases).
Incoming links weight was way higher than I thought but as Erik said 
it's certainly due to the fact that Paul's score prefers the current 
results and the formula used on enwiki currently tends to overboost 
incoming links.

Then because it was fun to play with I tried to replace the number of 
incoming links with pageviews data as a query independent factor, (in a 
previous test the system seemed to prefer incoming links).
I played with the weight (x axis) and k (k is a factor to re-arrange 
pageviews distribution) :
I was unable to isolate an optimal point but the score maxed out at 0.61 
(0.58 with incLinks) with a very high weight (maybe more than 2) :
- 
https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=shari… 
(sorry it's hard to read because k is very low, a log scale graph would 
be more useful here).
Here I think we start to "over-optimize" the params against the Paul's 
score query set, but again, surprisingly, my initial constraints are 
nearly respected (kennedy => JFK was out of the top 10 but still on the 
first page at #13)...

I ran the same optimization again with incoming links to double check :
Same here I played with weight (x axis) and k, the score maxed out at 0.58
- 
https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=shari…
But with optimal settings k=10 and w=1.2 my constraints are broken: 
kennedy=>JFK is no more in the first page, so I stopped here.

By combining a set of sensible constraints I tend to think that 
"semi-automatic" optimization is possible.

Note that the number of queries used in this experiment is way too low 
to make any conclusion but I think it proves that the *method* could 
work. With a larger query sets and more constraints we could maybe 
conclude that pageviews is a better signal than incoming links?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge