Yes! Thanks for putting this all in words. I'm really bad at putting things in writing so I appreciate this even more.

On Thursday, December 10, 2015, Kevin Smith <ksmith@wikimedia.org> wrote:

Excellent summary. Please make sure this is on wiki as well.

Thanks

Kevin

On Dec 10, 2015 8:05 AM, "Oliver Keyes" <okeyes@wikimedia.org> wrote:
Totally unrelated to my previous email, I promise. This is just me
writing down my thinking on how A/B testing works, and how it applies
to the portal (www.wikipedia.org) experiments and the schema we have
deployed there.

A/B testing is a common way of identifying if a proposed change to a
piece of software is actually an improvement or not: it consists of
taking a sample of users and dividing them into two groups, the "A"
and "B" groups (hence the name). One group is consistently given the
experimental change (the "test" group). One group is consistently
given the default experience (the "control" group). Users are
pseudorandomly sorted into each group, so that both groups are even.
The end outcome for both groups is compared, and the change is
successful if users in the test group are statistically significantly
more likely to experience a better outcome than the users in the
control group.

When we put together the schema for the Portal we did it after months
of experimenting with the Cirrus A/B tests, which means that we tried
to structure it to take into account the lessons we learned there. We
discovered that things were simpler the more fields you had; that
maintaining a base population who were not participating in any tests
was ideal for dashboarding. Accordingly the schema tracks every KPI we
care about for the portal and contains a "cohort" field that indicates
if someone is in the "A" group, the "B" group, or no group whatsoever
- with the idea that most users at any one time would be in /no/ group
and we could rely on that population for dashboarding! That way we can
handle everything with one schema.

So the things to remember when setting up Portal tests:

1. The test and control groups should be even;
2. The test and control group should (together) make up a very small
chunk of the total people getting the logging. 10% combined, say.
3. The test and control group should both be represented with "cohort"
values, with nothing (to produce a MySQL NULL) for the rest of the
population.

That way we can both test and dashboard simultaneously.

--
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery