Final results of the first A/B test - Wikimedia-search

26 Aug 2015


      Hey all,
Several weeks ago we ran an A/B test to try and decrease the number of
searches on Wikipedia returning zero results. This consisted of a
small config change that reduced the confidence needed for our systems
to provide search results, along with a change to the smoothing
algorithm used to bump the quality of the results now provided.
Our intent was not only to reduce the zero results rate but also to
prototype the actual process of A/B testing and identify issues we
could fix to make future tests easier and more reliable - this was the
first A/B test we had run.
5% of searches were registered as a control group, and an additional
5% subject to the reduced confidence and smoothing algorithm change.
An initial analysis over the first day's 7m events was run on 7
August, and a final analysis looking at an entire week of data was
completed yesterday. You can read the full results at
https://github.com/wikimedia-research/SuggestItOff/blob/master/initial_analy...
and https://github.com/wikimedia-research/SuggestItOff/blob/master/final_analysi...
Based on what we've seen we conclude that there is, at best, a
negligible effect from this change - and it's hard to tell if there
even is an effect at all. Accordingly, we recommend the default
behaviour be used for all users, and the experiment disabled.
This may sound like a failure, but it's actually not. For one thing,
we've learned that the config variables here probably aren't our venue
for dramatic changes - the defaults are pretty sensible. If we're
looking for dramatic changes in the rate, we should be applying
dramatic changes in the system's behaviour.
In addition, we identified a lot of process issues we can fix for the
next round of A/B tests, making it easier to analyse the data that
comes in and making the results easier to rely on. These include:
1. Who the gods would destroy, they first bless with real-time
analytics. The dramatic difference between the outcome of the initial
and final analysis speaks partly to the small size of the effect seen
and the power analysis issues mentoned below, but it's also a good
reminder that a single day of data, however many datapoints it
contains, is rarely the answer. User behaviour varies dramatically
depending on the day of the week or the month of the year - a week
should be seen as the minimum testing period for us to have any
confidence in what we see.
2. Power analysis is a must-have. Our hypothesis for the negligible
and totally varying size of the effect is simply the amount of data we
had; when you're looking at millions upon millions of events for a
pair of options, you're going to see patterns - because with enough
data stared at for long enough, pretty much anything can happen. That
doesn't mean it's /real/. In the future we need to be setting our
sample size using proper, a-priori power analysis - or switching to
Bayesian methods where this sort of problem doesn't appear.
These are not new lessons to the org as a whole (at least, I hope not)
but they are nice reminders, and I hope that sharing them allows us to
start building up an org-wide understanding of how we A/B test and the
costs of not doing things in a very deliberate way.
Thanks,
-- 
Oliver Keyes
Count Logula
Wikimedia Foundation