Well, peeking is okay as long as you don't act on it:
“Peeking” at the data is OK as long as you can restrain yourself from
stopping an experiment before it has run its course. I
know this goes
against something in human nature, so perhaps the best advice is: no
peeking!
It does take up time, though, and based only on data from the morning of
the deployment it may not give a representative preview. It's still fun to
peek, though. ;)
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Mon, Aug 31, 2015 at 2:05 PM, Mikhail Popov <mpopov(a)wikimedia.org> wrote:
Hi all,
Last week we discussed our approach to A/B testing and we've decided to
have a week (at least) between tests.
A two-week-minimum cadence will give the analysis team enough time to
thoroughly think about the experimental design of each test, as well as
give the engineers enough time to implement it. Which is great because some
of the changes we are planning to test are not trivial and we don't want to
rush a test out and realize halfway through that we should have been
tracking something we're not.
We are also going to move away from doing initial analyses (analysis of
the data from the morning of a launch) for practical and scientific
reasons. Practical in the sense that we've been putting time and effort
into getting preliminary results that are not representative of final
results whatsoever while putting other work on the backburner. Scientific
in the sense that peeking at the data mid-experiment is bad science:
*Repeated significance testing always increases the rate of false
positives, that is, you’ll think many insignificant results are significant
(but not the other way around). The problem will be present if you ever
find yourself “peeking” at the data and stopping an experiment that seems
to be giving a significant result. The more you peek, the more your
significance levels will be off. For example, if you peek at an ongoing
experiment ten times, then what you think is 1% significance is actually
just 5% significance.* – Evan Miller, How Not To Run An A/B Test
<http://www.evanmiller.org/how-not-to-run-an-ab-test.html>
In science, it's a problem called multiple comparisons. The more tests you
perform, the more likely you are to see something where there is nothing.
Going forward, we are going to wait until we have collected all the data
before analyzing it.
Cheers,
Mikhail, Junior Swifty
Discovery // The Swifties
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search