Le 07/03/2016 19:01, Erik Bernhardson a écrit :
I've put together some initial patches to our
Relevance Forge over the
weekend to move this forward. Relevance Forge can now source queries
and click through data from the satisfaction schema to replay against
an elasticsearch cluster. It calculates Paul's engine score (i need a
better name than engine score for that...) with a single CirrusSearch
config, or it can apply a grid search over a multi-dimensional set of
CirrusSearch config values (numbers only for now) to try and find the
optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly)
heavily prefer our current results. I say unsurprisingly because the
satisfaction schema shows ~87% of clicks being within the top 3
results. Additionally Paul's engine score is meant to work best
grouping by users over some length of time, where we only have ~10
minute sessions which contain an average of 2 searches per session.
Because of this I have also implemented nDCG calculation within
Relevance Forge (was relatively easy), but unfortunately we do not
have the data to plug into this equation to generate a score. Queries
being PII can make this a bit difficult, but I'm sure we can find a
reasonable way. I demo's a POC at last week's CREDIT showcase that
makes collecting these scores less of a chore, but it needs to be
expanded. There are, at least, two main things that have to happen
before we can start calculating nDCG. I really like Justin's idea to
pull in results from other search engines to be scored, this would
allow us to improve recall and see a positive improvement in the nDCG
score. We also have to either get a set of queries whitelisted through
legal so we can run this in labs and have our own scoring + assistence
from the community, or we have to build up a database of queries and
distribute it to team members so they can rate the relevance of
queries on their local machine, and then aggregate those databases
back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score
or nDCG, directly deal with measuring the impact of fixing our recall
problems, although perhaps i misread how the idealized DCG is supposed
to work. Pulling in results from other search engines will help, so at
least improved recall wont result in a reduced score, but I was
thinking maybe we could calculate an alternate idealized DCG which
uses the highest rated documents for a query, rather than just
resorting the result set by the relevance score?
Thanks Erik, your new patch was a revelation! :)
Yes I totally agree, none of these metrics will help us to find the
features needed to improve recall. If the result is not in the top 20 it
can't be evaluated by the score.
Paul's score and query clicks grabbed from search logs will help us to
tune the existing features but won't help us to discover the missing ones.
Both nDCG and Paul's score help to score the ability of the system to
rank the top 20 results properly but unfortunately we know nothing about
the results that are out of the top 20: i.e. are the relevant docs found
or are they missed?
Is it actually a ranking precision problem that results in a bad recall
in the top 20 or a pure recall problem because the result is not even in
the whole result set?
Example:
Some users pointed out the inability of the system to weight properly
words in the title:
On enwiki here is some query where relevant results are not in the top
20 but are in the whole result set (page 2 or more):
- history france: => History of France
- france middle ages => France in the Middle Ages
- syrian civil war casualties => Casualties of the Syrian Civil War
- cities in the san francisco bay area => List of cities and towns in
the San Francisco Bay Area
- new york university alumni => List of New York University alumni
And this one which is imho very representative of the problem :
- legend film 2015 => Legend (2015 film)
In these cases our "recall problem" is in reality a ranking precision
problem where "recall features" (synonyms, spell corrections,
phonetic...) would not really help (it could be even worse).
All these queries were reported on phabricator or on this list. I'm not
sure that we could really find a score that helps us to discover such
problems. These scores are only useful to tune the top-N ranking.
We should maybe start to build a small set of queries that are easily
mapped to a system feature, more like a simple regression test. This is
maybe what Paul suggests when he says that the set of queries you chose
is not necessarily relevant of the usage (search logs and clicks) but
relevant to the system?
When you say "I was thinking maybe we could calculate an alternate
idealized DCG which uses the highest rated documents for a query, rather
than just resorting the result set by the relevance score? "
Do you mean : I want a system that is able to say: if result Y is not
returned in the top 10 for query X then it's a problem?
A score that acts as a set of constraints, kind of "must-have" results
for a specific query?
If yes I agree with you and it is what I did naively when playing with
your patch (see below).
Experiments with your patch:
I've set-up BM25 and a custom query builder to address specifically the
bad precision regarding words in the title (BM25 was needed because the
default lucene TD/IDF is very bad and does not allow fine-tuning: score
range is too big and noisy)
I tried to manually constraint the system with the following queries on
an enwiki index (
http://en-suggesty.wmflabs.org/wiki/Special:Search) :
- kennedy => JFK in the top 10 (feature: boost on incoming links/page views)
- sf novelist (and sf novelist*s*) => List of science fiction authors
(constraint not to overboost phrase or title)
- all the queries listed above concerning poor title match
Manual tuning was extremely hard and misleading, phrase rescore boost
was initially 10 (current default) and I had to decrease it to 0.2 (it
took a lot of my time playing with various weight settings).
Surprisingly automatic tuning with Erik's patch found nearly the same
conclusion with a totally different query set, preferred phrase boost
was 0.3 (it took few hours to compute).
Paul's score started from 0.50 and maxed out at 0.58, below is the
result of 4 iterations where I gradually narrowed the search space:
-
https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
- axis: x is the weight of the phrase rescore => best 0.3
- axis: y is the weight of incoming_links => best 1.0
Note that my initial constraints were still respected (and even slightly
better in some cases).
Incoming links weight was way higher than I thought but as Erik said
it's certainly due to the fact that Paul's score prefers the current
results and the formula used on enwiki currently tends to overboost
incoming links.
Then because it was fun to play with I tried to replace the number of
incoming links with pageviews data as a query independent factor, (in a
previous test the system seemed to prefer incoming links).
I played with the weight (x axis) and k (k is a factor to re-arrange
pageviews distribution) :
I was unable to isolate an optimal point but the score maxed out at 0.61
(0.58 with incLinks) with a very high weight (maybe more than 2) :
-
https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=shari…
(sorry it's hard to read because k is very low, a log scale graph would
be more useful here).
Here I think we start to "over-optimize" the params against the Paul's
score query set, but again, surprisingly, my initial constraints are
nearly respected (kennedy => JFK was out of the top 10 but still on the
first page at #13)...
I ran the same optimization again with incoming links to double check :
Same here I played with weight (x axis) and k, the score maxed out at 0.58
-
https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=shari…
But with optimal settings k=10 and w=1.2 my constraints are broken:
kennedy=>JFK is no more in the first page, so I stopped here.
By combining a set of sensible constraints I tend to think that
"semi-automatic" optimization is possible.
Note that the number of queries used in this experiment is way too low
to make any conclusion but I think it proves that the *method* could
work. With a larger query sets and more constraints we could maybe
conclude that pageviews is a better signal than incoming links?