I've put together some initial patches to our
Relevance Forge over the weekend to move this forward. Relevance
Forge can now source queries and click through data from the
satisfaction schema to replay against an elasticsearch cluster.
It calculates Paul's engine score (i need a better name than
engine score for that...) with a single CirrusSearch config, or
it can apply a grid search over a multi-dimensional set of
CirrusSearch config values (numbers only for now) to try and
find the optimal values for some parameters.
In the first couple times testing this it looks to
(unsurprisingly) heavily prefer our current results. I say
unsurprisingly because the satisfaction schema shows ~87% of
clicks being within the top 3 results. Additionally Paul's
engine score is meant to work best grouping by users over some
length of time, where we only have ~10 minute sessions which
contain an average of 2 searches per session. Because of this
I have also implemented nDCG calculation within Relevance
Forge (was relatively easy), but unfortunately we do not have
the data to plug into this equation to generate a score.
Queries being PII can make this a bit difficult, but I'm sure
we can find a reasonable way. I demo's a POC at last week's
CREDIT showcase that makes collecting these scores less of a
chore, but it needs to be expanded. There are, at least, two
main things that have to happen before we can start
calculating nDCG. I really like Justin's idea to pull in
results from other search engines to be scored, this would
allow us to improve recall and see a positive improvement in
the nDCG score. We also have to either get a set of queries
whitelisted through legal so we can run this in labs and have
our own scoring + assistence from the community, or we have to
build up a database of queries and distribute it to team
members so they can rate the relevance of queries on their
local machine, and then aggregate those databases back
together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's
engine score or nDCG, directly deal with measuring the impact
of fixing our recall problems, although perhaps i misread how
the idealized DCG is supposed to work. Pulling in results from
other search engines will help, so at least improved recall
wont result in a reduced score, but I was thinking maybe we
could calculate an alternate idealized DCG which uses the
highest rated documents for a query, rather than just
resorting the result set by the relevance score?