This thread started off list, but I'm hoping all of you watching along can help us along to brainstorm and improve search satisfaction. Note that these aren't all my thoughts, they are a conglomeration of thoughts (many copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's also why this might not all read like one person wrote it.
A few weeks ago I attended ElasticON and there was a good presentation about search satisfaction by Paul Nelson. One of the things he thought was incredibly important, that we had already been thinking about but hadn't moved forward enough on, was generating an Engine Score. This week Paul held an online webinar where he gave the same presentation but without such strict time constraints which Trey attended. You can find my summary of this presentation in last weeks email to this list, 'ElasticON notes'
Some things of note:
- He doesn't like the idea of golden corpora—but his idea is different from Trey's. He imagines a hand-selected set of "important" queries that find "important" documents. I don't like that either (at least not by itself). I always imagine a random selection of queries for a golden corpus. - He lumps BM25 in with TF/IDF and calls them ancient and unmotivated and from the 80s and 90s. David's convinced us that BM25 is a good thing to pursue. Of course, part of Search Technologies' purpose is to drum up business, so they can't say, "hey just use this in Elastic Search" or they'd be out of business. - He explains the mysterious K factor that got all this started in the first place. It controls how much weight changes far down the results list carry. It sounds like he might tune K based on the number of results for every query, but my question about that wasn't answered. In the demo, he's only pulling 25 results, which Erik's click-through data shows is probably enough. - He mentions that 25,000 "clicks" is a good enough sized set for measuring a score (and having random noise come out in the wash). Not clear if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.
David and Trey talked about this some, and Trey think's the idea of Paul's metric (Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID]) has a lot of appeal. It's based on clicks and user sessions, so we'd have to be able to capture all the relevant information and make it available somewhere to replay in Relevance Forge for assessment. We currently have a reasonable amount of clickthrough data collected from 0.5% of desktop search sessions that we can use for this task. There are some complications though because this is PII data and so has to be treated carefully.
Mikhail's goal for our user satisfaction metric is to have a function that maps features including dwell time to user satisfaction ratio. (e.g., 10s = 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The predictive model is going to include a variety of features of varying predictive power, such as dwell time, clickthrough rate, engagement (scrolling), etc. One problem with the user satisfaction metric is that it isn't replayable. We can't re-run the queries in vitro and get data on what users think of the new results. However it does play into Nelson's idea, discussed in the paper and maybe in the video, of gradable relevance. Assigning a user satisfaction score to a given result would allow us to weight various clicks in his metric rather than treating them all as equal (though that works, too, if it's all you have).
We need to build a system that we are able to tune in an effective way. As pointed by Trey cirrus does not allow us to tune the core similarity function params. David tend's to think that we need to replace our core similarity function with a new one that is suited for optimizations and BM25 allows it, there are certainly others and we could build our own. But the problem will be:
How to tune these parameters in an effective way, with BM25 we will have 7 fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune 3 params : weight, k1, and b for each field - weight is likely to range between 0 and 1 with maybe 2 digits precision steps - k1 from 1 to 2 - b from 0 to 1 And I'm not talking about the query independent factors like popularity, pagerank & co that we may want to add. It's clear that we will have to tackle hard search performance problems...
David tend's to think that we need to apply an optimization algorithm that will search for optimal combination according to an objective. David doesn't think we can run such optimization plan with A/B testing, it's why we need a way to replay a set of queries and compute various search engine scores. We don't know what's the best approach here: - extract the metrics from the search satisfaction schema that do not require user intervention (click and result position). - build our own set of queries with the tool Erik is building (temporary location: http://portal.wmflabs.org/search/index.php) -- Erik thinks we should do both, as they will give us completely different sets of information. The metrics about what our users are doing is a great source of information provides a good signal. The tool Erik is building comes at the problem from a different direction, sourcing search results from wiki/google/bing/ddg and getting humans to rate which results are relevant/not relevant on a scale of 1 to 4. This can be used with other algorithms to generate an independent score. Essentially I think the best Relevance Forge will output a multi-dimensional engine score and not just a single number. -- We should set up records of how this engine score changes over days, months, and longer, so we can see a rate of improvement (or lack thereof. But hopefully improvement :)
And in the end will this (BM25 and/or searching with weights per field) work? - not sure, maybe the text features we have today are not relevant and we need to spend more time on extracting relevant text features from the mediawiki content model (https://phabricator.wikimedia.org/T128076) but we should be able to say : this field has no or only bad impact impact.
The big picture would be: - Refactor cirrus in a way that everything is suited for optimization - search engine score: the objective (Erik added it as goal) - Optimization algorithm to search/tune the system params. Trey has prior experience working within optimization frameworks. Mikhail also has relevant machine learning experience. - A/B testing with advanced metrics to confirm that the optimization found good combination
With a framework like that we could spend more time on big impact text features (wikitext, synonyms, spelling correction ...). But yes it's a huge engineering task with a lot of challenges :/ It's also
Thanks Erik for summarizing the discussion so far.
The very last sentence got cut off:
But yes it's a huge engineering task with a lot of challenges :/ It's also
I think I know what was next:
... a fun engineering task with many new things to learn! :)
Even if that wasn't the next bit, it's still true.
On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
This thread started off list, but I'm hoping all of you watching along can help us along to brainstorm and improve search satisfaction. Note that these aren't all my thoughts, they are a conglomeration of thoughts (many copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's also why this might not all read like one person wrote it.
A few weeks ago I attended ElasticON and there was a good presentation about search satisfaction by Paul Nelson. One of the things he thought was incredibly important, that we had already been thinking about but hadn't moved forward enough on, was generating an Engine Score. This week Paul held an online webinar where he gave the same presentation but without such strict time constraints which Trey attended. You can find my summary of this presentation in last weeks email to this list, 'ElasticON notes'
Some things of note:
- He doesn't like the idea of golden corpora—but his idea is different
from Trey's. He imagines a hand-selected set of "important" queries that find "important" documents. I don't like that either (at least not by itself). I always imagine a random selection of queries for a golden corpus.
- He lumps BM25 in with TF/IDF and calls them ancient and unmotivated
and from the 80s and 90s. David's convinced us that BM25 is a good thing to pursue. Of course, part of Search Technologies' purpose is to drum up business, so they can't say, "hey just use this in Elastic Search" or they'd be out of business.
- He explains the mysterious K factor that got all this started in the
first place. It controls how much weight changes far down the results list carry. It sounds like he might tune K based on the number of results for every query, but my question about that wasn't answered. In the demo, he's only pulling 25 results, which Erik's click-through data shows is probably enough.
- He mentions that 25,000 "clicks" is a good enough sized set for
measuring a score (and having random noise come out in the wash). Not clear if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.
David and Trey talked about this some, and Trey think's the idea of Paul's metric (Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID]) has a lot of appeal. It's based on clicks and user sessions, so we'd have to be able to capture all the relevant information and make it available somewhere to replay in Relevance Forge for assessment. We currently have a reasonable amount of clickthrough data collected from 0.5% of desktop search sessions that we can use for this task. There are some complications though because this is PII data and so has to be treated carefully.
Mikhail's goal for our user satisfaction metric is to have a function that maps features including dwell time to user satisfaction ratio. (e.g., 10s = 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The predictive model is going to include a variety of features of varying predictive power, such as dwell time, clickthrough rate, engagement (scrolling), etc. One problem with the user satisfaction metric is that it isn't replayable. We can't re-run the queries in vitro and get data on what users think of the new results. However it does play into Nelson's idea, discussed in the paper and maybe in the video, of gradable relevance. Assigning a user satisfaction score to a given result would allow us to weight various clicks in his metric rather than treating them all as equal (though that works, too, if it's all you have).
We need to build a system that we are able to tune in an effective way. As pointed by Trey cirrus does not allow us to tune the core similarity function params. David tend's to think that we need to replace our core similarity function with a new one that is suited for optimizations and BM25 allows it, there are certainly others and we could build our own. But the problem will be:
How to tune these parameters in an effective way, with BM25 we will have 7 fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune 3 params : weight, k1, and b for each field
- weight is likely to range between 0 and 1 with maybe 2 digits precision
steps
- k1 from 1 to 2
- b from 0 to 1
And I'm not talking about the query independent factors like popularity, pagerank & co that we may want to add. It's clear that we will have to tackle hard search performance problems...
David tend's to think that we need to apply an optimization algorithm that will search for optimal combination according to an objective. David doesn't think we can run such optimization plan with A/B testing, it's why we need a way to replay a set of queries and compute various search engine scores. We don't know what's the best approach here:
- extract the metrics from the search satisfaction schema that do not
require user intervention (click and result position).
- build our own set of queries with the tool Erik is building (temporary
location: http://portal.wmflabs.org/search/index.php) -- Erik thinks we should do both, as they will give us completely different sets of information. The metrics about what our users are doing is a great source of information provides a good signal. The tool Erik is building comes at the problem from a different direction, sourcing search results from wiki/google/bing/ddg and getting humans to rate which results are relevant/not relevant on a scale of 1 to 4. This can be used with other algorithms to generate an independent score. Essentially I think the best Relevance Forge will output a multi-dimensional engine score and not just a single number. -- We should set up records of how this engine score changes over days, months, and longer, so we can see a rate of improvement (or lack thereof. But hopefully improvement :)
And in the end will this (BM25 and/or searching with weights per field) work?
- not sure, maybe the text features we have today are not relevant and we
need to spend more time on extracting relevant text features from the mediawiki content model (https://phabricator.wikimedia.org/T128076) but we should be able to say : this field has no or only bad impact impact.
The big picture would be:
- Refactor cirrus in a way that everything is suited for optimization
- search engine score: the objective (Erik added it as goal)
- Optimization algorithm to search/tune the system params. Trey has prior
experience working within optimization frameworks. Mikhail also has relevant machine learning experience.
- A/B testing with advanced metrics to confirm that the optimization found
good combination
With a framework like that we could spend more time on big impact text features (wikitext, synonyms, spelling correction ...). But yes it's a huge engineering task with a lot of challenges :/ It's also
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?
On Mon, Mar 7, 2016 at 6:05 AM, Trey Jones tjones@wikimedia.org wrote:
Thanks Erik for summarizing the discussion so far.
The very last sentence got cut off:
But yes it's a huge engineering task with a lot of challenges :/ It's also
I think I know what was next:
... a fun engineering task with many new things to learn! :)
Even if that wasn't the next bit, it's still true.
On Fri, Mar 4, 2016 at 8:24 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
This thread started off list, but I'm hoping all of you watching along can help us along to brainstorm and improve search satisfaction. Note that these aren't all my thoughts, they are a conglomeration of thoughts (many copy/pasted from off-list emails) from Trey, David, Mikhail and I. That's also why this might not all read like one person wrote it.
A few weeks ago I attended ElasticON and there was a good presentation about search satisfaction by Paul Nelson. One of the things he thought was incredibly important, that we had already been thinking about but hadn't moved forward enough on, was generating an Engine Score. This week Paul held an online webinar where he gave the same presentation but without such strict time constraints which Trey attended. You can find my summary of this presentation in last weeks email to this list, 'ElasticON notes'
Some things of note:
- He doesn't like the idea of golden corpora—but his idea is
different from Trey's. He imagines a hand-selected set of "important" queries that find "important" documents. I don't like that either (at least not by itself). I always imagine a random selection of queries for a golden corpus.
- He lumps BM25 in with TF/IDF and calls them ancient and unmotivated
and from the 80s and 90s. David's convinced us that BM25 is a good thing to pursue. Of course, part of Search Technologies' purpose is to drum up business, so they can't say, "hey just use this in Elastic Search" or they'd be out of business.
- He explains the mysterious K factor that got all this started in
the first place. It controls how much weight changes far down the results list carry. It sounds like he might tune K based on the number of results for every query, but my question about that wasn't answered. In the demo, he's only pulling 25 results, which Erik's click-through data shows is probably enough.
- He mentions that 25,000 "clicks" is a good enough sized set for
measuring a score (and having random noise come out in the wash). Not clear if he meant 25K clicks, or 25K user sessions, since it was in the Q&A.
David and Trey talked about this some, and Trey think's the idea of Paul's metric (Σ power(FACTOR, position) * isRelevant[user, searchResult[Q,position].DocID]) has a lot of appeal. It's based on clicks and user sessions, so we'd have to be able to capture all the relevant information and make it available somewhere to replay in Relevance Forge for assessment. We currently have a reasonable amount of clickthrough data collected from 0.5% of desktop search sessions that we can use for this task. There are some complications though because this is PII data and so has to be treated carefully.
Mikhail's goal for our user satisfaction metric is to have a function that maps features including dwell time to user satisfaction ratio. (e.g., 10s = 20% likely to be satisfied, 10m = 94% likely to be satisfied, etc.). The predictive model is going to include a variety of features of varying predictive power, such as dwell time, clickthrough rate, engagement (scrolling), etc. One problem with the user satisfaction metric is that it isn't replayable. We can't re-run the queries in vitro and get data on what users think of the new results. However it does play into Nelson's idea, discussed in the paper and maybe in the video, of gradable relevance. Assigning a user satisfaction score to a given result would allow us to weight various clicks in his metric rather than treating them all as equal (though that works, too, if it's all you have).
We need to build a system that we are able to tune in an effective way. As pointed by Trey cirrus does not allow us to tune the core similarity function params. David tend's to think that we need to replace our core similarity function with a new one that is suited for optimizations and BM25 allows it, there are certainly others and we could build our own. But the problem will be:
How to tune these parameters in an effective way, with BM25 we will have 7 fields with 2 analyzers : 14 internal lucene fields. BM25 allows to tune 3 params : weight, k1, and b for each field
- weight is likely to range between 0 and 1 with maybe 2 digits precision
steps
- k1 from 1 to 2
- b from 0 to 1
And I'm not talking about the query independent factors like popularity, pagerank & co that we may want to add. It's clear that we will have to tackle hard search performance problems...
David tend's to think that we need to apply an optimization algorithm that will search for optimal combination according to an objective. David doesn't think we can run such optimization plan with A/B testing, it's why we need a way to replay a set of queries and compute various search engine scores. We don't know what's the best approach here:
- extract the metrics from the search satisfaction schema that do not
require user intervention (click and result position).
- build our own set of queries with the tool Erik is building (temporary
location: http://portal.wmflabs.org/search/index.php) -- Erik thinks we should do both, as they will give us completely different sets of information. The metrics about what our users are doing is a great source of information provides a good signal. The tool Erik is building comes at the problem from a different direction, sourcing search results from wiki/google/bing/ddg and getting humans to rate which results are relevant/not relevant on a scale of 1 to 4. This can be used with other algorithms to generate an independent score. Essentially I think the best Relevance Forge will output a multi-dimensional engine score and not just a single number. -- We should set up records of how this engine score changes over days, months, and longer, so we can see a rate of improvement (or lack thereof. But hopefully improvement :)
And in the end will this (BM25 and/or searching with weights per field) work?
- not sure, maybe the text features we have today are not relevant and we
need to spend more time on extracting relevant text features from the mediawiki content model (https://phabricator.wikimedia.org/T128076) but we should be able to say : this field has no or only bad impact impact.
The big picture would be:
- Refactor cirrus in a way that everything is suited for optimization
- search engine score: the objective (Erik added it as goal)
- Optimization algorithm to search/tune the system params. Trey has prior
experience working within optimization frameworks. Mikhail also has relevant machine learning experience.
- A/B testing with advanced metrics to confirm that the optimization
found good combination
With a framework like that we could spend more time on big impact text features (wikitext, synonyms, spelling correction ...). But yes it's a huge engineering task with a lot of challenges :/ It's also
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Le 07/03/2016 19:01, Erik Bernhardson a écrit :
I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?
Thanks Erik, your new patch was a revelation! :)
Yes I totally agree, none of these metrics will help us to find the features needed to improve recall. If the result is not in the top 20 it can't be evaluated by the score. Paul's score and query clicks grabbed from search logs will help us to tune the existing features but won't help us to discover the missing ones. Both nDCG and Paul's score help to score the ability of the system to rank the top 20 results properly but unfortunately we know nothing about the results that are out of the top 20: i.e. are the relevant docs found or are they missed? Is it actually a ranking precision problem that results in a bad recall in the top 20 or a pure recall problem because the result is not even in the whole result set?
Example: Some users pointed out the inability of the system to weight properly words in the title: On enwiki here is some query where relevant results are not in the top 20 but are in the whole result set (page 2 or more): - history france: => History of France - france middle ages => France in the Middle Ages - syrian civil war casualties => Casualties of the Syrian Civil War - cities in the san francisco bay area => List of cities and towns in the San Francisco Bay Area - new york university alumni => List of New York University alumni And this one which is imho very representative of the problem : - legend film 2015 => Legend (2015 film)
In these cases our "recall problem" is in reality a ranking precision problem where "recall features" (synonyms, spell corrections, phonetic...) would not really help (it could be even worse). All these queries were reported on phabricator or on this list. I'm not sure that we could really find a score that helps us to discover such problems. These scores are only useful to tune the top-N ranking. We should maybe start to build a small set of queries that are easily mapped to a system feature, more like a simple regression test. This is maybe what Paul suggests when he says that the set of queries you chose is not necessarily relevant of the usage (search logs and clicks) but relevant to the system?
When you say "I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score? " Do you mean : I want a system that is able to say: if result Y is not returned in the top 10 for query X then it's a problem? A score that acts as a set of constraints, kind of "must-have" results for a specific query? If yes I agree with you and it is what I did naively when playing with your patch (see below).
Experiments with your patch: I've set-up BM25 and a custom query builder to address specifically the bad precision regarding words in the title (BM25 was needed because the default lucene TD/IDF is very bad and does not allow fine-tuning: score range is too big and noisy) I tried to manually constraint the system with the following queries on an enwiki index (http://en-suggesty.wmflabs.org/wiki/Special:Search) : - kennedy => JFK in the top 10 (feature: boost on incoming links/page views) - sf novelist (and sf novelist*s*) => List of science fiction authors (constraint not to overboost phrase or title) - all the queries listed above concerning poor title match
Manual tuning was extremely hard and misleading, phrase rescore boost was initially 10 (current default) and I had to decrease it to 0.2 (it took a lot of my time playing with various weight settings). Surprisingly automatic tuning with Erik's patch found nearly the same conclusion with a totally different query set, preferred phrase boost was 0.3 (it took few hours to compute). Paul's score started from 0.50 and maxed out at 0.58, below is the result of 4 iterations where I gradually narrowed the search space: - https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c - axis: x is the weight of the phrase rescore => best 0.3 - axis: y is the weight of incoming_links => best 1.0 Note that my initial constraints were still respected (and even slightly better in some cases). Incoming links weight was way higher than I thought but as Erik said it's certainly due to the fact that Paul's score prefers the current results and the formula used on enwiki currently tends to overboost incoming links.
Then because it was fun to play with I tried to replace the number of incoming links with pageviews data as a query independent factor, (in a previous test the system seemed to prefer incoming links). I played with the weight (x axis) and k (k is a factor to re-arrange pageviews distribution) : I was unable to isolate an optimal point but the score maxed out at 0.61 (0.58 with incLinks) with a very high weight (maybe more than 2) : - https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=sharin... (sorry it's hard to read because k is very low, a log scale graph would be more useful here). Here I think we start to "over-optimize" the params against the Paul's score query set, but again, surprisingly, my initial constraints are nearly respected (kennedy => JFK was out of the top 10 but still on the first page at #13)...
I ran the same optimization again with incoming links to double check : Same here I played with weight (x axis) and k, the score maxed out at 0.58 - https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=sharin... But with optimal settings k=10 and w=1.2 my constraints are broken: kennedy=>JFK is no more in the first page, so I stopped here.
By combining a set of sensible constraints I tend to think that "semi-automatic" optimization is possible.
Note that the number of queries used in this experiment is way too low to make any conclusion but I think it proves that the *method* could work. With a larger query sets and more constraints we could maybe conclude that pageviews is a better signal than incoming links?
Dwell time: Would it be hard to graph dwell time vs next search similarity? This may give you an idea of the correct dwell time thresholds for your satisfaction metric. For query similarity, jaccardindex is trivial (union count / intersection count), tfidf weighted cosine similarity shouldn't be hard.
If you have enough data, the right dwell time will likely change based on various page factors {isDisambiguation, page length, page categories, etc} and you could make an independent graphs for each. On a disambiguation page you may want to merge in the dwell time of a click on that page. This would be a tradeoff which implies that disambiguation are ~0 cost.
User satisfaction metric: (I think this is your search engine score?) You may want to include time-to-success, time-to-failure as additional factors, # clicks in session, etc. Strongest is likely a session success rate (did they eventually find what they wanted). This style of combined metric tends to evolve over time as you get greater insight in to what determines if a user is satisfied.
Paul's metric: Sounds a lot like DCG (part of nDCG); it seems to avg over distinct users which otherwise could cause skew in small search engines (heavy users would skew metrics). [note, I haven't seen his talk]
Optimization: There's two main training set styles to use. (1) Hand judged results, or (2) click based logs. There's many tradeoffs. Hand judged training sets are needed for sites with low traffic; and they nicely need less prep work. Click based training set require prep work like a good metric to optimize towards (eg: SAT, which is non-obvious to discern). Hand judged sets often lack context (eg: location of user, freshness of documents, intent of user, other results displays, etc) and instead tries to give a global label to a query-result pair.
nDCG & recall: The normalization in nDCG is created by finding the optimal ordering of results which are expected to be in your index. For example your judgement set from enwiki for q=[legend film 2015 https://en.wikipedia.org/w/index.php?search=legend+film+2015&title=Special%3ASearch&go=Go] may not include the correct result of, legend (2015 film) https://en.wikipedia.org/wiki/Legend_(2015_film) (@ pos #25). To counter, include results from Google https://www.google.com/search?q=site%3Awikipedia.org+legend+film+2015/Bing https://www.bing.com/search?q=site%3Awikipedia.org+legend+(2015+film)/DDG https://duckduckgo.com/?q=site%3Awikipedia.org+legend+(2015+film)/etc https://www.yandex.com/search/?text=site%3Awikipedia.org%20legend%20(2015%20film) where q=[site:wikipedia.org legend film 2015] (expected result is @ pos #1 in each); or judge the top 50 results from each set. Or, explicitly ask a judge "what are the best pages for this query" and include these candidates pages for the next judge.
Judgement platform: I prefer to display all of the results at once instead of one-by-one. This increases the ratings per second at the trade-off of independence. Once you have enough judgements in, you may want to look at inter-judge agreement; do all judges rate the query-result pair a 4?; do some rate a 1? This tends to imply poor instructions, though true disagreements happen too.
--justin
On Tue, Mar 8, 2016 at 5:51 AM, David Causse dcausse@wikimedia.org wrote:
Le 07/03/2016 19:01, Erik Bernhardson a écrit :
I've put together some initial patches to our Relevance Forge over the weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly) heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?
Thanks Erik, your new patch was a revelation! :)
Yes I totally agree, none of these metrics will help us to find the features needed to improve recall. If the result is not in the top 20 it can't be evaluated by the score. Paul's score and query clicks grabbed from search logs will help us to tune the existing features but won't help us to discover the missing ones. Both nDCG and Paul's score help to score the ability of the system to rank the top 20 results properly but unfortunately we know nothing about the results that are out of the top 20: i.e. are the relevant docs found or are they missed? Is it actually a ranking precision problem that results in a bad recall in the top 20 or a pure recall problem because the result is not even in the whole result set?
Example: Some users pointed out the inability of the system to weight properly words in the title: On enwiki here is some query where relevant results are not in the top 20 but are in the whole result set (page 2 or more):
- history france: => History of France
- france middle ages => France in the Middle Ages
- syrian civil war casualties => Casualties of the Syrian Civil War
- cities in the san francisco bay area => List of cities and towns in the
San Francisco Bay Area
- new york university alumni => List of New York University alumni
And this one which is imho very representative of the problem :
- legend film 2015 => Legend (2015 film)
In these cases our "recall problem" is in reality a ranking precision problem where "recall features" (synonyms, spell corrections, phonetic...) would not really help (it could be even worse). All these queries were reported on phabricator or on this list. I'm not sure that we could really find a score that helps us to discover such problems. These scores are only useful to tune the top-N ranking. We should maybe start to build a small set of queries that are easily mapped to a system feature, more like a simple regression test. This is maybe what Paul suggests when he says that the set of queries you chose is not necessarily relevant of the usage (search logs and clicks) but relevant to the system?
When you say "I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score? " Do you mean : I want a system that is able to say: if result Y is not returned in the top 10 for query X then it's a problem? A score that acts as a set of constraints, kind of "must-have" results for a specific query? If yes I agree with you and it is what I did naively when playing with your patch (see below).
Experiments with your patch: I've set-up BM25 and a custom query builder to address specifically the bad precision regarding words in the title (BM25 was needed because the default lucene TD/IDF is very bad and does not allow fine-tuning: score range is too big and noisy) I tried to manually constraint the system with the following queries on an enwiki index (http://en-suggesty.wmflabs.org/wiki/Special:Search) :
- kennedy => JFK in the top 10 (feature: boost on incoming links/page
views)
- sf novelist (and sf novelist*s*) => List of science fiction authors
(constraint not to overboost phrase or title)
- all the queries listed above concerning poor title match
Manual tuning was extremely hard and misleading, phrase rescore boost was initially 10 (current default) and I had to decrease it to 0.2 (it took a lot of my time playing with various weight settings). Surprisingly automatic tuning with Erik's patch found nearly the same conclusion with a totally different query set, preferred phrase boost was 0.3 (it took few hours to compute). Paul's score started from 0.50 and maxed out at 0.58, below is the result of 4 iterations where I gradually narrowed the search space:
- https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
- axis: x is the weight of the phrase rescore => best 0.3
- axis: y is the weight of incoming_links => best 1.0
Note that my initial constraints were still respected (and even slightly better in some cases). Incoming links weight was way higher than I thought but as Erik said it's certainly due to the fact that Paul's score prefers the current results and the formula used on enwiki currently tends to overboost incoming links.
Then because it was fun to play with I tried to replace the number of incoming links with pageviews data as a query independent factor, (in a previous test the system seemed to prefer incoming links). I played with the weight (x axis) and k (k is a factor to re-arrange pageviews distribution) : I was unable to isolate an optimal point but the score maxed out at 0.61 (0.58 with incLinks) with a very high weight (maybe more than 2) :
https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=sharin... (sorry it's hard to read because k is very low, a log scale graph would be more useful here). Here I think we start to "over-optimize" the params against the Paul's score query set, but again, surprisingly, my initial constraints are nearly respected (kennedy => JFK was out of the top 10 but still on the first page at #13)...
I ran the same optimization again with incoming links to double check : Same here I played with weight (x axis) and k, the score maxed out at 0.58
https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=sharin... But with optimal settings k=10 and w=1.2 my constraints are broken: kennedy=>JFK is no more in the first page, so I stopped here.
By combining a set of sensible constraints I tend to think that "semi-automatic" optimization is possible.
Note that the number of queries used in this experiment is way too low to make any conclusion but I think it proves that the *method* could work. With a larger query sets and more constraints we could maybe conclude that pageviews is a better signal than incoming links?
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Mar 22, 2016 2:26 AM, "Justin Ormont" justin.ormont@gmail.com wrote:
Dwell time: Would it be hard to graph dwell time vs next search similarity? This may
give you an idea of the correct dwell time thresholds for your satisfaction metric. For query similarity, jaccardindex is trivial (union count / intersection count), tfidf weighted cosine similarity shouldn't be hard.
Shouldn't be too hard, looks like something could be done fairly easily sucking all the data into Python to calculate and graph. Our sessions are only 10 minutes currently and have an average of just under 2 searches per session so I'm not sure how much data will remain after filtering down to sessions with a search after a click, but isn't too hard to find out.
If you have enough data, the right dwell time will likely change based on
various page factors {isDisambiguation, page length, page categories, etc} and you could make an independent graphs for each. On a disambiguation page you may want to merge in the dwell time of a click on that page. This would be a tradeoff which implies that disambiguation are ~0 cost.
I'm thinking it's unlikely we currently have enough data for this. Partially because we were only collecting data for desktop full text search. We expanded this to start collecting auto-complete a week ago, and we have a goal up for the next couple months to expand to mobile web and apps so hopefully in the future we will start to have more data but hard to say yet how much it is.
User satisfaction metric: (I think this is your search engine score?) You may want to include time-to-success, time-to-failure as additional
factors, # clicks in session, etc. Strongest is likely a session success rate (did they eventually find what they wanted). This style of combined metric tends to evolve over time as you get greater insight in to what determines if a user is satisfied.
The engine score right now is just Paul's metric, for satisfaction we are using sessions that click through and have a dwell time of more than 10 seconds. We haven't really been comfortable calling this satisfaction though due to not being able to prove any of those users are satisfied, so it's called augmented click through in our dashboards. We do need to come up with something we can properly label a satisfaction metric and Mikhail is doing some work in that direction.
Mikhail will also be doing the first analysis of an ab test with some of these more advanced metrics soon, the test (on the phrase boost weight) was turned off today after running for 8 days. I'm not finding the right ticket from my phone but I think he will be looking at time to first click, searches per session, clicks per session and a few others. Will certainly be interesting to figure out how to merge all this into a more comprehensive SAT metric than the simplistic strict dwell time used today.
Paul's metric: Sounds a lot like DCG (part of nDCG); it seems to avg over distinct users
which otherwise could cause skew in small search engines (heavy users would skew metrics). [note, I haven't seen his talk]
It is a form of DCG as I understand it as well. The use of per-user data is two fold: one part is keeping heavy users from over influencing the result, the other is using a binary relevance signal which treats a click through by a user as relevant to all searches performed by that user. This should mean if the user does two searches and only clicks a result in the second the score will improve by moving that result up in the first result set. The default in RelForge for this currently is to use only clicks with >= 10s dwell time, but as mentioned above this is a very rough SAT metric.
Optimization: There's two main training set styles to use. (1) Hand judged results, or
(2) click based logs. There's many tradeoffs. Hand judged training sets are needed for sites with low traffic; and they nicely need less prep work. Click based training set require prep work like a good metric to optimize towards (eg: SAT, which is non-obvious to discern). Hand judged sets often lack context (eg: location of user, freshness of documents, intent of user, other results displays, etc) and instead tries to give a global label to a query-result pair.
Right now we are basically pursuing both types independently to see what works. My idea currently is to optimize against one score, and then verify by looking at other scores with the chosen configuration vs prod settings. Those problems with hand judged content are becoming more obvious to me as i try out the judgement platform while getting it ready to use. There are queries where i don't know what should be relevant when looking at results aggregated from a few search engines. I'm hoping that we can get multiple reviewers for each query and the aggregate will be useful.
nDCG & recall: The normalization in nDCG is created by finding the optimal ordering of
results which are expected to be in your index. For example your judgement set from enwiki for q=[legend film 2015] may not include the correct result of, legend (2015 film) (@ pos #25). To counter, include results from Google/Bing/DDG/etc where q=[site:wikipedia.org legend film 2015] (expected result is @ pos #1 in each); or judge the top 50 results from each set. Or, explicitly ask a judge "what are the best pages for this query" and include these candidates pages for the next judge.
Not a bad idea to ask the grader for other options, but I'm hoping it won't be necessary in the first iteration. I've integrated the idea to pull results from multiple engines for use. I'm rather hopeful that on its own will be plenty initially, because the other engines often have much better results than our own.
Judgement platform: I prefer to display all of the results at once instead of one-by-one.
This increases the ratings per second at the trade-off of independence. Once you have enough judgements in, you may want to look at inter-judge agreement; do all judges rate the query-result pair a 4?; do some rate a 1? This tends to imply poor instructions, though true disagreements happen too.
I hadn't considered what effect showing one at a time would have on ratings per second but it makes plenty of sense. At this point we are probably better off optimizing for speed and making the Independence trade off because we don't have a large number of graders to lean on. I'll have to switch up the way that part of the platform works sometime this week.
--justin
On Tue, Mar 8, 2016 at 5:51 AM, David Causse dcausse@wikimedia.org
wrote:
Le 07/03/2016 19:01, Erik Bernhardson a écrit :
I've put together some initial patches to our Relevance Forge over the
weekend to move this forward. Relevance Forge can now source queries and click through data from the satisfaction schema to replay against an elasticsearch cluster. It calculates Paul's engine score (i need a better name than engine score for that...) with a single CirrusSearch config, or it can apply a grid search over a multi-dimensional set of CirrusSearch config values (numbers only for now) to try and find the optimal values for some parameters.
In the first couple times testing this it looks to (unsurprisingly)
heavily prefer our current results. I say unsurprisingly because the satisfaction schema shows ~87% of clicks being within the top 3 results. Additionally Paul's engine score is meant to work best grouping by users over some length of time, where we only have ~10 minute sessions which contain an average of 2 searches per session. Because of this I have also implemented nDCG calculation within Relevance Forge (was relatively easy), but unfortunately we do not have the data to plug into this equation to generate a score. Queries being PII can make this a bit difficult, but I'm sure we can find a reasonable way. I demo's a POC at last week's CREDIT showcase that makes collecting these scores less of a chore, but it needs to be expanded. There are, at least, two main things that have to happen before we can start calculating nDCG. I really like Justin's idea to pull in results from other search engines to be scored, this would allow us to improve recall and see a positive improvement in the nDCG score. We also have to either get a set of queries whitelisted through legal so we can run this in labs and have our own scoring + assistence from the community, or we have to build up a database of queries and distribute it to team members so they can rate the relevance of queries on their local machine, and then aggregate those databases back together (a pain...there must be a better way).
One issue i see is that neither of these metrics, Paul's engine score
or nDCG, directly deal with measuring the impact of fixing our recall problems, although perhaps i misread how the idealized DCG is supposed to work. Pulling in results from other search engines will help, so at least improved recall wont result in a reduced score, but I was thinking maybe we could calculate an alternate idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score?
Thanks Erik, your new patch was a revelation! :)
Yes I totally agree, none of these metrics will help us to find the
features needed to improve recall. If the result is not in the top 20 it can't be evaluated by the score.
Paul's score and query clicks grabbed from search logs will help us to
tune the existing features but won't help us to discover the missing ones.
Both nDCG and Paul's score help to score the ability of the system to
rank the top 20 results properly but unfortunately we know nothing about the results that are out of the top 20: i.e. are the relevant docs found or are they missed?
Is it actually a ranking precision problem that results in a bad recall
in the top 20 or a pure recall problem because the result is not even in the whole result set?
Example: Some users pointed out the inability of the system to weight properly
words in the title:
On enwiki here is some query where relevant results are not in the top
20 but are in the whole result set (page 2 or more):
- history france: => History of France
- france middle ages => France in the Middle Ages
- syrian civil war casualties => Casualties of the Syrian Civil War
- cities in the san francisco bay area => List of cities and towns in
the San Francisco Bay Area
- new york university alumni => List of New York University alumni
And this one which is imho very representative of the problem :
- legend film 2015 => Legend (2015 film)
In these cases our "recall problem" is in reality a ranking precision
problem where "recall features" (synonyms, spell corrections, phonetic...) would not really help (it could be even worse).
All these queries were reported on phabricator or on this list. I'm not
sure that we could really find a score that helps us to discover such problems. These scores are only useful to tune the top-N ranking.
We should maybe start to build a small set of queries that are easily
mapped to a system feature, more like a simple regression test. This is maybe what Paul suggests when he says that the set of queries you chose is not necessarily relevant of the usage (search logs and clicks) but relevant to the system?
When you say "I was thinking maybe we could calculate an alternate
idealized DCG which uses the highest rated documents for a query, rather than just resorting the result set by the relevance score? "
Do you mean : I want a system that is able to say: if result Y is not
returned in the top 10 for query X then it's a problem?
A score that acts as a set of constraints, kind of "must-have" results
for a specific query?
If yes I agree with you and it is what I did naively when playing with
your patch (see below).
Experiments with your patch: I've set-up BM25 and a custom query builder to address specifically the
bad precision regarding words in the title (BM25 was needed because the default lucene TD/IDF is very bad and does not allow fine-tuning: score range is too big and noisy)
I tried to manually constraint the system with the following queries on
an enwiki index (http://en-suggesty.wmflabs.org/wiki/Special:Search) :
- kennedy => JFK in the top 10 (feature: boost on incoming links/page
views)
- sf novelist (and sf novelist*s*) => List of science fiction authors
(constraint not to overboost phrase or title)
- all the queries listed above concerning poor title match
Manual tuning was extremely hard and misleading, phrase rescore boost
was initially 10 (current default) and I had to decrease it to 0.2 (it took a lot of my time playing with various weight settings).
Surprisingly automatic tuning with Erik's patch found nearly the same
conclusion with a totally different query set, preferred phrase boost was 0.3 (it took few hours to compute).
Paul's score started from 0.50 and maxed out at 0.58, below is the
result of 4 iterations where I gradually narrowed the search space:
- https://drive.google.com/drive/folders/0Bzo2vOqfrXhJMzFmcWthZ0Ytc2c
- axis: x is the weight of the phrase rescore => best 0.3
- axis: y is the weight of incoming_links => best 1.0
Note that my initial constraints were still respected (and even slightly
better in some cases).
Incoming links weight was way higher than I thought but as Erik said
it's certainly due to the fact that Paul's score prefers the current results and the formula used on enwiki currently tends to overboost incoming links.
Then because it was fun to play with I tried to replace the number of
incoming links with pageviews data as a query independent factor, (in a previous test the system seemed to prefer incoming links).
I played with the weight (x axis) and k (k is a factor to re-arrange
pageviews distribution) :
I was unable to isolate an optimal point but the score maxed out at 0.61
(0.58 with incLinks) with a very high weight (maybe more than 2) :
https://drive.google.com/file/d/0Bzo2vOqfrXhJdTFNUEZEanhzQVU/view?usp=sharin... (sorry it's hard to read because k is very low, a log scale graph would be more useful here).
Here I think we start to "over-optimize" the params against the Paul's
score query set, but again, surprisingly, my initial constraints are nearly respected (kennedy => JFK was out of the top 10 but still on the first page at #13)...
I ran the same optimization again with incoming links to double check : Same here I played with weight (x axis) and k, the score maxed out at
0.58
https://drive.google.com/file/d/0Bzo2vOqfrXhJUmMxV2V0OThZUE0/view?usp=sharin...
But with optimal settings k=10 and w=1.2 my constraints are broken:
kennedy=>JFK is no more in the first page, so I stopped here.
By combining a set of sensible constraints I tend to think that
"semi-automatic" optimization is possible.
Note that the number of queries used in this experiment is way too low
to make any conclusion but I think it proves that the *method* could work. With a larger query sets and more constraints we could maybe conclude that pageviews is a better signal than incoming links?
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery