The title is mostly to get you're attention, you know I like A/B testing. With that being said:
For a quarter and a bit we've been running A/B tests. Doing so has been intensely time-consuming for both engineering and analysis, and at times it's felt like we're pushing changes out just to test them, rather than because we have reason to believe there will be dramatic improvements.
These tests have produced, at best, mixed results. Many of the tests have not shown a substantial improvement in the metric we have been testing - the zero results rate. Those that have have not been deployed further because we cannot, from the ZRR alone, test the _utility_ of produced results: for that we need to A/B test against clickthroughs, or a satisfaction metric.
So where do we go from here?
In my mind, the ideal is that we stop A/B testing against the zero results rate.
This doesn't mean we stop testing improvements: this means we build the relevance lab up and out and test the zero results rate against /that/. ZRR does not need user participation, it needs the participation of user *queries*: with the relevance lab we can consume user queries and test ideas against them at a fraction of the cost of a full A/B test.
Instead, we use the A/B tests for the other component: the utility component. If something passes the Relevance Lab ring of fire, we A/B test it against clickthroughs: this will be rarer than "every two weeks" and so we can afford to spend some time making sure the test is A+ scientifically, and all our ducks are in a row.
The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on - something that has thus far escaped us due to attention being focused on deploying More Tests rather than completely validating the ones we have already deployed.
Thoughts?