Re: [Wikimedia-search] On frequency of A/B tests and peeking at the data early

1 Sep 2015

Well, peeking is okay as long as you don't act on it:

“Peeking” at the data is OK as long as you can restrain yourself from
...
  stopping an experiment before it has run its course. I
know this goes
 against something in human nature, so perhaps the best advice is: no
 peeking! 

It does take up time, though, and based only on data from the morning of
the deployment it may not give a representative preview. It's still fun to
peek, though. ;)

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Mon, Aug 31, 2015 at 2:05 PM, Mikhail Popov &lt;mpopov(a)wikimedia.org&gt; wrote:

...
  Hi all,

 Last week we discussed our approach to A/B testing and we've decided to
 have a week (at least) between tests.

 A two-week-minimum cadence will give the analysis team enough time to
 thoroughly think about the experimental design of each test, as well as
 give the engineers enough time to implement it. Which is great because some
 of the changes we are planning to test are not trivial and we don't want to
 rush a test out and realize halfway through that we should have been
 tracking something we're not.

 We are also going to move away from doing initial analyses (analysis of
 the data from the morning of a launch) for practical and scientific
 reasons. Practical in the sense that we've been putting time and effort
 into getting preliminary results that are not representative of final
 results whatsoever while putting other work on the backburner. Scientific
 in the sense that peeking at the data mid-experiment is bad science:

 *Repeated significance testing always increases the rate of false
 positives, that is, you’ll think many insignificant results are significant
 (but not the other way around). The problem will be present if you ever
 find yourself “peeking” at the data and stopping an experiment that seems
 to be giving a significant result. The more you peek, the more your
 significance levels will be off. For example, if you peek at an ongoing
 experiment ten times, then what you think is 1% significance is actually
 just 5% significance.* – Evan Miller, How Not To Run An A/B Test
 <http://www.evanmiller.org/how-not-to-run-an-ab-test.html>

 In science, it's a problem called multiple comparisons. The more tests you
 perform, the more likely you are to see something where there is nothing.
 Going forward, we are going to wait until we have collected all the data
 before analyzing it.

 Cheers,
 Mikhail, Junior Swifty
 Discovery // The Swifties

 _______________________________________________
 Wikimedia-search mailing list
 Wikimedia-search(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] On frequency of A/B tests and peeking at the data early