The title is mostly to get you're attention, you know I like A/B
testing. With that being said:
For a quarter and a bit we've been running A/B tests. Doing so has
been intensely time-consuming for both engineering and analysis, and
at times it's felt like we're pushing changes out just to test them,
rather than because we have reason to believe there will be dramatic
improvements.
These tests have produced, at best, mixed results. Many of the tests
have not shown a substantial improvement in the metric we have been
testing - the zero results rate. Those that have have not been
deployed further because we cannot, from the ZRR alone, test the
_utility_ of produced results: for that we need to A/B test against
clickthroughs, or a satisfaction metric.
So where do we go from here?
In my mind, the ideal is that we stop A/B testing against the zero results rate.
This doesn't mean we stop testing improvements: this means we build
the relevance lab up and out and test the zero results rate against
/that/. ZRR does not need user participation, it needs the
participation of user *queries*: with the relevance lab we can consume
user queries and test ideas against them at a fraction of the cost of
a full A/B test.
Instead, we use the A/B tests for the other component: the utility
component. If something passes the Relevance Lab ring of fire, we A/B
test it against clickthroughs: this will be rarer than "every two
weeks" and so we can afford to spend some time making sure the test is
A+ scientifically, and all our ducks are in a row.
The result will be not only better tests, but a better impact for
users, because we will actually be able to deploy the improvements we
have worked on - something that has thus far escaped us due to
attention being focused on deploying More Tests rather than completely
validating the ones we have already deployed.
Thoughts?
--
Oliver Keyes
Count Logula
Wikimedia Foundation