The title is mostly to get you're attention, you know I like A/B testing. With that being said:
For a quarter and a bit we've been running A/B tests. Doing so has been intensely time-consuming for both engineering and analysis, and at times it's felt like we're pushing changes out just to test them, rather than because we have reason to believe there will be dramatic improvements.
These tests have produced, at best, mixed results. Many of the tests have not shown a substantial improvement in the metric we have been testing - the zero results rate. Those that have have not been deployed further because we cannot, from the ZRR alone, test the _utility_ of produced results: for that we need to A/B test against clickthroughs, or a satisfaction metric.
So where do we go from here?
In my mind, the ideal is that we stop A/B testing against the zero results rate.
This doesn't mean we stop testing improvements: this means we build the relevance lab up and out and test the zero results rate against /that/. ZRR does not need user participation, it needs the participation of user *queries*: with the relevance lab we can consume user queries and test ideas against them at a fraction of the cost of a full A/B test.
Instead, we use the A/B tests for the other component: the utility component. If something passes the Relevance Lab ring of fire, we A/B test it against clickthroughs: this will be rarer than "every two weeks" and so we can afford to spend some time making sure the test is A+ scientifically, and all our ducks are in a row.
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on - something that has thus far escaped us due to attention being focused on deploying More Tests rather than completely validating the ones we have already deployed.
Thoughts?
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on
That is the hope and promise of the relevance lab. And it can test more than the ZRR.
We already have the ability to check the rates of change to the top *n* results (ignoring order or not). So, you can see if anything breaks into the top 5 results (ignoring shuffling among the top 5), or note any change at all in the top 5 (including reshuffling). Mechanically, you can see that, for example, a change only affects the top 5 results for 0.1% of your query corpus, so regardless of whether it is good or bad, it isn't super high impact.
The relevance lab report also provides links to diffs of example queries that are affected by a change, so you can review them to get a sense of whether they are good or bad. It's subjective, but you can sometimes get a rough sense of things by looking at a few dozen randomly chosen examples. If you look at 25 examples and 90% of them are clearly worse because of the change, you know you need to fix something, even with the giant confidence interval such a small sample entails.
So, even without a gold standard corpus of graded search results, you can use the relevance lab to do some pre-testing of a change and get a sense of whether it's doing what you want on general queries (and not just the handful you were focused on trying to fix).
You can also test impact and effectiveness on a focused query corpus of rarer query types (e.g., queries of more than 10 words).
And adding other metrics is pretty straightforward, if anyone has any ideas of places to take note of changes. And, on the back burner, I have some ideas for improvements that look at other annotations on a query and how to incorporate/test those in the relevance lab.
So, I agree with you on the relevance lab side. I'm also looking forward to better user acceptance testing, whether through more complex click-stream metrics or micro surveys or whatever else works. We collectively suffer from the curse of knowledge when it comes to search—it's hard to know what users who don't spend a non-trivial portion of their professional lives contemplating search will really like/want/use.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Dec 10, 2015 at 10:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
The title is mostly to get you're attention, you know I like A/B testing. With that being said:
For a quarter and a bit we've been running A/B tests. Doing so has been intensely time-consuming for both engineering and analysis, and at times it's felt like we're pushing changes out just to test them, rather than because we have reason to believe there will be dramatic improvements.
These tests have produced, at best, mixed results. Many of the tests have not shown a substantial improvement in the metric we have been testing - the zero results rate. Those that have have not been deployed further because we cannot, from the ZRR alone, test the _utility_ of produced results: for that we need to A/B test against clickthroughs, or a satisfaction metric.
So where do we go from here?
In my mind, the ideal is that we stop A/B testing against the zero results rate.
This doesn't mean we stop testing improvements: this means we build the relevance lab up and out and test the zero results rate against /that/. ZRR does not need user participation, it needs the participation of user *queries*: with the relevance lab we can consume user queries and test ideas against them at a fraction of the cost of a full A/B test.
Instead, we use the A/B tests for the other component: the utility component. If something passes the Relevance Lab ring of fire, we A/B test it against clickthroughs: this will be rarer than "every two weeks" and so we can afford to spend some time making sure the test is A+ scientifically, and all our ducks are in a row.
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on - something that has thus far escaped us due to attention being focused on deploying More Tests rather than completely validating the ones we have already deployed.
Thoughts?
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Oh wow, these are fantastic features!
On 10 December 2015 at 11:28, Trey Jones tjones@wikimedia.org wrote:
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on
That is the hope and promise of the relevance lab. And it can test more than the ZRR.
We already have the ability to check the rates of change to the top n results (ignoring order or not). So, you can see if anything breaks into the top 5 results (ignoring shuffling among the top 5), or note any change at all in the top 5 (including reshuffling). Mechanically, you can see that, for example, a change only affects the top 5 results for 0.1% of your query corpus, so regardless of whether it is good or bad, it isn't super high impact.
The relevance lab report also provides links to diffs of example queries that are affected by a change, so you can review them to get a sense of whether they are good or bad. It's subjective, but you can sometimes get a rough sense of things by looking at a few dozen randomly chosen examples. If you look at 25 examples and 90% of them are clearly worse because of the change, you know you need to fix something, even with the giant confidence interval such a small sample entails.
So, even without a gold standard corpus of graded search results, you can use the relevance lab to do some pre-testing of a change and get a sense of whether it's doing what you want on general queries (and not just the handful you were focused on trying to fix).
You can also test impact and effectiveness on a focused query corpus of rarer query types (e.g., queries of more than 10 words).
And adding other metrics is pretty straightforward, if anyone has any ideas of places to take note of changes. And, on the back burner, I have some ideas for improvements that look at other annotations on a query and how to incorporate/test those in the relevance lab.
So, I agree with you on the relevance lab side. I'm also looking forward to better user acceptance testing, whether through more complex click-stream metrics or micro surveys or whatever else works. We collectively suffer from the curse of knowledge when it comes to search—it's hard to know what users who don't spend a non-trivial portion of their professional lives contemplating search will really like/want/use.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Dec 10, 2015 at 10:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
The title is mostly to get you're attention, you know I like A/B testing. With that being said:
For a quarter and a bit we've been running A/B tests. Doing so has been intensely time-consuming for both engineering and analysis, and at times it's felt like we're pushing changes out just to test them, rather than because we have reason to believe there will be dramatic improvements.
These tests have produced, at best, mixed results. Many of the tests have not shown a substantial improvement in the metric we have been testing - the zero results rate. Those that have have not been deployed further because we cannot, from the ZRR alone, test the _utility_ of produced results: for that we need to A/B test against clickthroughs, or a satisfaction metric.
So where do we go from here?
In my mind, the ideal is that we stop A/B testing against the zero results rate.
This doesn't mean we stop testing improvements: this means we build the relevance lab up and out and test the zero results rate against /that/. ZRR does not need user participation, it needs the participation of user *queries*: with the relevance lab we can consume user queries and test ideas against them at a fraction of the cost of a full A/B test.
Instead, we use the A/B tests for the other component: the utility component. If something passes the Relevance Lab ring of fire, we A/B test it against clickthroughs: this will be rarer than "every two weeks" and so we can afford to spend some time making sure the test is A+ scientifically, and all our ducks are in a row.
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on - something that has thus far escaped us due to attention being focused on deploying More Tests rather than completely validating the ones we have already deployed.
Thoughts?
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery