I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?