Le 07/03/2016 19:01, Erik Bernhardson a écrit :
I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.

In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded.  There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).

One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?

Thanks Erik, your new patch was a revelation! :)

Yes I totally agree, none of these metrics will help us to find the features needed to improve recall. If the result is not in the top 20 it can't be evaluated by the score.
Paul's score and query clicks grabbed from search logs will help us to tune the existing features but won't help us to discover the missing ones.
Both nDCG and Paul's score help to score the ability of the system to rank the top 20 results properly but unfortunately we know nothing about the results that are out of the top 20: i.e. are the relevant docs found or are they missed?
Is it actually a ranking precision problem that results in a bad recall in the top 20 or a pure recall problem because the result is not even in the whole result set?

Example:
Some users pointed out the inability of the system to weight properly words in the title:
On enwiki here is some query where relevant results are not in the top 20 but are in the whole result set (page 2 or more):
- history france: => History of France
- france middle ages => France in the Middle Ages
- syrian civil war casualties => Casualties of the Syrian Civil War
- cities in the san francisco bay area => List of cities and towns in the San Francisco Bay Area
- new york university alumni => List of New York University alumni
And this one which is imho very representative of the problem :
- legend film 2015 => Legend (2015 film)

In these cases our "recall problem" is in reality a ranking precision problem where "recall features" (synonyms, spell corrections, phonetic...) would not really help (it could be even worse).
All these queries were reported on phabricator or on this list. I'm not sure that we could really find a score that helps us to discover such problems. These scores are only useful to tune the top-N ranking.
We should maybe start to build a small set of queries that are easily mapped to a system feature, more like a simple regression test. This is maybe what Paul suggests when he says that the set of queries you chose is not necessarily relevant of the usage (search logs and clicks) but relevant to the system?

When you say "I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score? "
Do you mean : I want a system that is able to say: if result Y is not returned in the top 10 for query X then it's a problem?
A score that acts as a set of constraints, kind of "must-have" results for a specific query?
If yes I agree with you and it is what I did naively when playing with your patch (see below).

Experiments with your patch:
I've set-up BM25 and a custom query builder to address specifically the bad precision regarding words in the title (BM25 was needed because the default lucene TD/IDF is very bad and does not allow fine-tuning: score range is too big and noisy)
I tried to manually constraint the system with the following queries on an enwiki index (http://en-suggesty.wmflabs.org/wiki/Special:Search) :
- kennedy => JFK in the top 10 (feature: boost on incoming links/page views)
- sf novelist (and sf novelist*s*) => List of science fiction authors (constraint not to overboost phrase or title)
- all the queries listed above concerning poor title match

Manual tuning was extremely hard and misleading, phrase rescore boost was initially 10 (current default) and I had to decrease it to 0.2 (it took a lot of my time playing with various weight settings).
Surprisingly automatic tuning with Erik's patch found nearly the same conclusion with a totally different query set, preferred phrase boost was 0.3 (it took few hours to compute).
Paul's score started from 0.50 and maxed out at 0.58, below is the result of 4 iterations where I gradually narrowed the search space:
- https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
- axis: x is the weight of the phrase rescore => best 0.3
- axis: y is the weight of incoming_links => best 1.0
Note that my initial constraints were still respected (and even slightly better in some cases).
Incoming links weight was way higher than I thought but as Erik said it's certainly due to the fact that Paul's score prefers the current results and the formula used on enwiki currently tends to overboost incoming links.

Then because it was fun to play with I tried to replace the number of incoming links with pageviews data as a query independent factor, (in a previous test the system seemed to prefer incoming links).
I played with the weight (x axis) and k (k is a factor to re-arrange pageviews distribution) :
I was unable to isolate an optimal point but the score maxed out at 0.61 (0.58 with incLinks) with a very high weight (maybe more than 2) :
- https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=sharing (sorry it's hard to read because k is very low, a log scale graph would be more useful here).
Here I think we start to "over-optimize" the params against the Paul's score query set, but again, surprisingly, my initial constraints are nearly respected (kennedy => JFK was out of the top 10 but still on the first page at #13)...

I ran the same optimization again with incoming links to double check :
Same here I played with weight (x axis) and k, the score maxed out at 0.58
- https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=sharing
But with optimal settings k=10 and w=1.2 my constraints are broken: kennedy=>JFK is no more in the first page, so I stopped here.

By combining a set of sensible constraints I tend to think that "semi-automatic" optimization is possible.

Note that the number of queries used in this experiment is way too low to make any conclusion but I think it proves that the *method* could work. With a larger query sets and more constraints we could maybe conclude that pageviews is a better signal than incoming links?