Hey all,
Several weeks ago we ran an A/B test to try and decrease the number of searches on Wikipedia returning zero results. This consisted of a small config change that reduced the confidence needed for our systems to provide search results, along with a change to the smoothing algorithm used to bump the quality of the results now provided.
Our intent was not only to reduce the zero results rate but also to prototype the actual process of A/B testing and identify issues we could fix to make future tests easier and more reliable - this was the first A/B test we had run.
5% of searches were registered as a control group, and an additional 5% subject to the reduced confidence and smoothing algorithm change. An initial analysis over the first day's 7m events was run on 7 August, and a final analysis looking at an entire week of data was completed yesterday. You can read the full results at https://github.com/wikimedia-research/SuggestItOff/blob/master/initial_analy... and https://github.com/wikimedia-research/SuggestItOff/blob/master/final_analysi...
Based on what we've seen we conclude that there is, at best, a negligible effect from this change - and it's hard to tell if there even is an effect at all. Accordingly, we recommend the default behaviour be used for all users, and the experiment disabled.
This may sound like a failure, but it's actually not. For one thing, we've learned that the config variables here probably aren't our venue for dramatic changes - the defaults are pretty sensible. If we're looking for dramatic changes in the rate, we should be applying dramatic changes in the system's behaviour.
In addition, we identified a lot of process issues we can fix for the next round of A/B tests, making it easier to analyse the data that comes in and making the results easier to rely on. These include:
1. Who the gods would destroy, they first bless with real-time analytics. The dramatic difference between the outcome of the initial and final analysis speaks partly to the small size of the effect seen and the power analysis issues mentoned below, but it's also a good reminder that a single day of data, however many datapoints it contains, is rarely the answer. User behaviour varies dramatically depending on the day of the week or the month of the year - a week should be seen as the minimum testing period for us to have any confidence in what we see. 2. Power analysis is a must-have. Our hypothesis for the negligible and totally varying size of the effect is simply the amount of data we had; when you're looking at millions upon millions of events for a pair of options, you're going to see patterns - because with enough data stared at for long enough, pretty much anything can happen. That doesn't mean it's /real/. In the future we need to be setting our sample size using proper, a-priori power analysis - or switching to Bayesian methods where this sort of problem doesn't appear.
These are not new lessons to the org as a whole (at least, I hope not) but they are nice reminders, and I hope that sharing them allows us to start building up an org-wide understanding of how we A/B test and the costs of not doing things in a very deliberate way.
Thanks,