Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge

23 Mar 2016

On Mar 22, 2016 2:26 AM, "Justin Ormont" &lt;justin.ormont(a)gmail.com&gt; wrote:
...

 Dwell time:
 Would it be hard to graph dwell time vs next search similarity? This may give you
an idea of the correct dwell time thresholds for your satisfaction
metric. For query similarity, jaccardindex is trivial (union count /
intersection count), tfidf weighted cosine similarity shouldn't be hard.
...
  Shouldn't be too hard, looks like something
could be done fairly easily
sucking all the data into Python to calculate and graph. Our sessions are
only 10 minutes currently and have an average of just under 2 searches per
session so I'm not sure how much data will remain after filtering down to
sessions with a search after a click, but isn't too hard to find out.

...
  If you have enough data, the right dwell time will
likely change based on various page factors {isDisambiguation, page length, page
categories, etc}
and you could make an independent graphs for each. On a disambiguation page
you may want to merge in the dwell time of a click on that page. This would
be a tradeoff which implies that disambiguation are ~0 cost.

I'm thinking it's unlikely we currently have enough data for this.
Partially because we were only collecting data for desktop full text
search. We expanded this to start collecting auto-complete a week ago, and
we have a goal up for the next couple months to expand to mobile web and
apps so hopefully in the future we will start to have more data but hard to
say yet how much it is.

...
  > User satisfaction metric: (I think this is
your search engine score?)
> You may want to include time-to-success, time-to-failure as additional
factors, # clicks in session, etc. Strongest is likely a session success
rate (did they eventually find what they wanted). This style of combined
metric tends to evolve over time as you get greater insight in to what
determines if a user is satisfied.

The engine score right now is just Paul's metric, for satisfaction we are
using sessions that click through and have a dwell time of more than 10
seconds. We haven't really been comfortable calling this satisfaction
though due to not being able to prove any of those users are satisfied, so
it's called augmented click through in our dashboards. We do need to come
up with something we can properly label a satisfaction metric and Mikhail
is doing some work in that direction.

Mikhail will also be doing the first analysis of an ab test with some of
these more advanced metrics soon, the test (on the phrase boost weight) was
turned off today after running for 8 days. I'm not finding the right ticket
from my phone but I think he will be looking at time to first click,
searches per session, clicks per session and a few others. Will certainly
be interesting to figure out how to merge all this into a more
comprehensive SAT metric than the simplistic strict dwell time used today.

...
  Paul's metric:
 Sounds a lot like DCG (part of nDCG); it seems to avg over distinct users which
otherwise could cause skew in small search engines (heavy users would
skew metrics). [note, I haven't seen his talk]
...

It is a form of DCG as I understand it as well. The use of per-user data is
two fold: one part is keeping heavy users from over influencing the result,
the other is using a binary relevance signal which treats a click through
by a user as relevant to all searches performed by that user. This should
mean if the user does two searches and only clicks a result in the second
the score will improve by moving that result up in the first result set.
The default in RelForge for this currently is to use only clicks with >=
10s dwell time, but as mentioned above this is a very rough SAT metric.

...
  Optimization:
 There's two main training set styles to use. (1) Hand judged results, or (2)
click based logs. There's many tradeoffs. Hand judged training sets are
needed for sites with low traffic; and they nicely need less prep work.
Click based training set require prep work like a good metric to optimize
towards (eg: SAT, which is non-obvious to discern). Hand judged sets often
lack context (eg: location of user, freshness of documents, intent of user,
other results displays, etc) and instead tries to give a global label to a
query-result pair.
...

Right now we are basically pursuing both types independently to see what
works. My idea currently is to optimize against one score, and then verify
by looking at other scores with the chosen configuration vs prod settings.
Those problems with hand judged content are becoming more obvious to me as
i try out the judgement platform while getting it ready to use. There are
queries where i don't know what should be relevant when looking at results
aggregated from a few search engines. I'm hoping that we can get multiple
reviewers for each query and the aggregate will be useful.

...
  nDCG & recall:
 The normalization in nDCG is created by finding the optimal ordering of results
which are expected to be in your index. For example your judgement
set from enwiki for q=[legend film 2015] may not include the correct result
of, legend (2015 film) (@ pos #25). To counter, include results from
Google/Bing/DDG/etc where q=[site:wikipedia.org legend film 2015] (expected
result is @ pos #1 in each); or judge the top 50 results from each set. Or,
explicitly ask a judge "what are the best pages for this query" and include
these candidates pages for the next judge.
...
  Not a bad idea to ask the grader for other
options, but I'm hoping it won't
be necessary in the first iteration. I've integrated the idea to pull
results from multiple engines for use. I'm rather hopeful that on its own
will be plenty initially, because the other engines often have much better
results than our own.

...
  Judgement platform:
 I prefer to display all of the results at once instead of one-by-one. This
increases the ratings per second at the trade-off of independence.
Once you have enough judgements in, you may want to look at inter-judge
agreement; do all judges rate the query-result pair a 4?; do some rate a 1?
This tends to imply poor instructions, though true disagreements happen too.

I hadn't considered what effect showing one at a time would have on ratings
per second but it makes plenty of sense. At this point we are probably
better off optimizing for speed and making the Independence trade off
because we don't have a large number of graders to lean on. I'll have to
switch up the way that part of the platform works sometime this week.

...
  > --justin
...
  > On Tue, Mar 8, 2016 at 5:51 AM, David
Causse &lt;dcausse(a)wikimedia.org...
  wrote:
>...
  >> Le 07/03/2016 19:01, Erik Bernhardson a
écrit :
>>...
  >>> I've
put together some initial patches to our Relevance Forge over the
weekend to move this forward. Relevance Forge can now source queries and
click through data from the satisfaction schema to replay against an
elasticsearch cluster. It calculates Paul's engine score (i need a better
name than engine score for that...) with a single CirrusSearch config, or
it can apply a grid search over a multi-dimensional set of CirrusSearch
config values (numbers only for now) to try and find the optimal values for
some parameters.
>>...
  >>> In the
first couple times testing this it looks to (unsurprisingly)
heavily prefer our current results. I say unsurprisingly because the
satisfaction schema shows ~87% of clicks being within the top 3 results.
Additionally Paul's engine score is meant to work best grouping by users
over some length of time, where we only have ~10 minute sessions which
contain an average of 2 searches per session. Because of this I have also
implemented nDCG calculation within Relevance Forge (was relatively easy),
but unfortunately we do not have the data to plug into this equation to
generate a score. Queries being PII can make this a bit difficult, but I'm
sure we can find a reasonable way. I demo's a POC at last week's CREDIT
showcase that makes collecting these scores less of a chore, but it needs
to be expanded.  There are, at least, two main things that have to happen
before we can start calculating nDCG. I really like Justin's idea to pull
in results from other search engines to be scored, this would allow us to
improve recall and see a positive improvement in the nDCG score. We also
have to either get a set of queries whitelisted through legal so we can run
this in labs and have our own scoring + assistence from the community, or
we have to build up a database of queries and distribute it to team members
so they can rate the relevance of queries on their local machine, and then
aggregate those databases back together (a pain...there must be a better
way).
>>...
  >>> One
issue i see is that neither of these metrics, Paul's engine score
or nDCG, directly deal with measuring the impact of fixing our recall
problems, although perhaps i misread how the idealized DCG is supposed to
work. Pulling in results from other search engines will help, so at least
improved recall wont result in a reduced score, but I was thinking maybe we
could calculate an alternate idealized DCG which uses the highest rated
documents for a query, rather than just resorting the result set by the
relevance score?
>...

>...

>> Thanks Erik, your new patch was a revelation! :)
>...
  >> Yes I totally agree, none of these
metrics will help us to find the
features needed to improve recall. If the result is not in the top 20 it
can't be evaluated by the score.
...
 > Paul's score and query clicks grabbed from
search logs will help us to tune the existing features but won't help us to
discover the missing ones.
...
 > Both nDCG and Paul's score help to score the
ability of the system to rank the top 20 results properly but unfortunately we know
nothing about
the results that are out of the top 20: i.e. are the relevant docs found or
are they missed?
...
 > Is it actually a ranking precision problem that
results in a bad recall in the top 20 or a pure recall problem because the result is
not even in
the whole result set?
>...
  >> Example:
>> Some users pointed out the inability of the system to weight properly
words in the title:
...
 > On enwiki here is some query where relevant
results are not in the top 20 but are in the whole result set (page 2 or more):
...
 > - history france: => History of France
> - france middle ages => France in the Middle Ages
> - syrian civil war casualties => Casualties of the Syrian Civil War
> - cities in the san francisco bay area => List of cities and towns in the
San Francisco Bay Area
>> - new york university alumni => List of New York University alumni
>> And this one which is imho very representative of the problem :
>> - legend film 2015 => Legend (2015 film)
>...
  >> In these cases our "recall
problem" is in reality a ranking precision
problem where "recall features" (synonyms, spell corrections, phonetic...)
would not really help (it could be even worse).
...
 > All these queries were reported on phabricator or
on this list. I'm not sure that we could really find a score that helps us to
discover such
problems. These scores are only useful to tune the top-N ranking.
...
 > We should maybe start to build a small set of
queries that are easily mapped to a system feature, more like a simple regression
test. This is
maybe what Paul suggests when he says that the set of queries you chose is
not necessarily relevant of the usage (search logs and clicks) but relevant
to the system?
>...
  >> When you say "I was thinking maybe
we could calculate an alternate
idealized DCG which uses the highest rated documents for a query, rather
than just resorting the result set by the relevance score? "
...
 > Do you mean : I want a system that is able to say:
if result Y is not returned in the top 10 for query X then it's a problem?
...
 > A score that acts as a set of constraints, kind of
"must-have" results for a specific query?
...
 > If yes I agree with you and it is what I did
naively when playing with your patch (see below).
>...
  >> Experiments with your patch:
>> I've set-up BM25 and a custom query builder to address specifically the
bad precision regarding words in the title (BM25 was needed because the
default lucene TD/IDF is very bad and does not allow fine-tuning: score
range is too big and noisy)
...
 > I tried to manually constraint the system with the
following queries on an enwiki index
(http://en-suggesty.wmflabs.org/wiki/Special:Search) :
...
 > - kennedy => JFK in the top 10 (feature: boost
on incoming links/page views)
...
 > - sf novelist (and sf novelist*s*) => List of
science fiction authors (constraint not to overboost phrase or title)
>> - all the queries listed above concerning poor title match
>...
  >> Manual tuning was extremely hard and
misleading, phrase rescore boost
was initially 10 (current default) and I had to decrease it to 0.2 (it took
a lot of my time playing with various weight settings).
...
 > Surprisingly automatic tuning with Erik's
patch found nearly the same conclusion with a totally different query set, preferred
phrase boost was
0.3 (it took few hours to compute).
...
 > Paul's score started from 0.50 and maxed out
at 0.58, below is the result of 4 iterations where I gradually narrowed the search
space:
...
 > -
https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
> - axis: x is the weight of the phrase rescore => best 0.3
> - axis: y is the weight of incoming_links => best 1.0
> Note that my initial constraints were still respected (and even slightly better
in some cases).
...
 > Incoming links weight was way higher than I
thought but as Erik said it's certainly due to the fact that Paul's score
prefers the current
results and the formula used on enwiki currently tends to overboost
incoming links.
>...
  >> Then because it was fun to play with I
tried to replace the number of
incoming links with pageviews data as a query independent factor, (in a
previous test the system seemed to prefer incoming links).
...
 > I played with the weight (x axis) and k (k is a
factor to re-arrange pageviews distribution) :
...
 > I was unable to isolate an optimal point but the
score maxed out at 0.61 (0.58 with incLinks) with a very high weight (maybe more
than 2) :
...
 > -
https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=shari…
(sorry it's hard to read because k is very low, a log scale graph would be
more useful here).
...
 > Here I think we start to "over-optimize"
the params against the Paul's score query set, but again, surprisingly, my
initial constraints are nearly
respected (kennedy => JFK was out of the top 10 but still on the first page
at #13)...
>...
  >> I ran the same optimization again with
incoming links to double check :
>> Same here I played with weight (x axis) and k, the score maxed out at
0.58
...
 > -
https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=shari…
...
 > But with optimal settings k=10 and w=1.2 my
constraints are broken: kennedy=>JFK is no more in the first page, so I stopped
here.
>...
  >> By combining a set of sensible
constraints I tend to think that
"semi-automatic" optimization is possible.
>...
  >> Note that the number of queries used in
this experiment is way too low
to make any conclusion but I think it proves that the *method* could work.
With a larger query sets and more constraints we could maybe conclude that
pageviews is a better signal than incoming links?
>...
  >>
_______________________________________________
>> discovery mailing list
>> discovery(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>...

...

...
  >
_______________________________________________
> discovery mailing list
> discovery(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery
...

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] Search Satisfaction/Success/Whatever Metrics and The Relevance Forge