Last week we discussed our approach to A/B testing and we've decided to
have a week (at least) between tests.
A two-week-minimum cadence will give the analysis team enough time to
thoroughly think about the experimental design of each test, as well as
give the engineers enough time to implement it. Which is great because some
of the changes we are planning to test are not trivial and we don't want to
rush a test out and realize halfway through that we should have been
tracking something we're not.
We are also going to move away from doing initial analyses (analysis of the
data from the morning of a launch) for practical and scientific reasons.
Practical in the sense that we've been putting time and effort into getting
preliminary results that are not representative of final results whatsoever
while putting other work on the backburner. Scientific in the sense that
peeking at the data mid-experiment is bad science:
*Repeated significance testing always increases the rate of false
positives, that is, you’ll think many insignificant results are significant
(but not the other way around). The problem will be present if you ever
find yourself “peeking” at the data and stopping an experiment that seems
to be giving a significant result. The more you peek, the more your
significance levels will be off. For example, if you peek at an ongoing
experiment ten times, then what you think is 1% significance is actually
just 5% significance.* – Evan Miller, How Not To Run An A/B Test
In science, it's a problem called multiple comparisons. The more tests you
perform, the more likely you are to see something where there is nothing.
Going forward, we are going to wait until we have collected all the data
before analyzing it.
Mikhail, Junior Swifty
Discovery // The Swifties
As a reminder, the entire Discovery department will be participating in
As it gets closer, we can make sure everyone has a plan for how to use the
day most productively.
Agile Coach, Wikimedia Foundation
---------- Forwarded message ----------
From: Andre Klapper <aklapper(a)wikimedia.org>
Date: Mon, Aug 31, 2015 at 3:24 PM
Subject: [Engineering] Save the date: Gerrit Cleanup Day: Wed, Sep 23
To: Development and Operations Engineers <engineering(a)lists.wikimedia.org>
I'm happy to announce a Gerrit Cleanup Day on Wed, September 23.
It's an experiment to reduce Wikimedia's code review backlog which
hurts growing our long-term code contributor base.
All dev/eng teams are supposed to join and use the day to primarily
review recently submitted open Gerrit changesets without a review,
focussing on volunteer contributions.
Please save the date! The event is also in the "WMF Engineering" Google
Calendar so you could copy it to your personal calendar.
https://phabricator.wikimedia.org/T88531 provides more information,
steps, links. Note it's still work in progress.
Your questions and feedback are welcome.
Andre Klapper | Wikimedia Bugwrangler
Engineering mailing list
This is a gentle reminder that if your task doesn't exist as a Phabricator
ticket, then the person/team who can get it done doesn't know there's an
actual need for it.
It allows us to avoid situations like "Person A said they think Person B is
working on Thing X" meanwhile B has no idea X was even a thing.
While doing CR for
I came to have serious doubts about this approach.
In brief, it attempts to track user satisfaction with search results by
measuring how long do people stay on pages. It does that by appending
fromsearch=1 to links for 0.5% of users. However, this results in page
views being uncached and thus increasing HTML load time by a factor of 4-5
and, consequentially, kicking even short pages' first paint outside of
comfort zone of 1 second - and that's measured from the office, with ping
of 2-3 ms to ulsfo. My concern here is that as a result we're trying to
measure the very metric we're screwing with, resulting in experiment being
Can we come up with a way of measurement that's less intrusive or alter the
requirements of the experiment?
Max Semenik ([[User:MaxSem]])
I've re-run my "big" wiki zero result rate numbers to see what has changed
in the last month. The results are here:
Since I was only looking at the big 52 wikis (100K+ articles), the zero
results rate is under 20% (good news), but it hasn't gone down in a month
I looked very briefly at the full text zero results rate for the rest of
the wikis for yesterday. That zero results rate was 56.6%! Lots of hits
from wikidata and itwiki-things! There were some DOI queries, but none of
the other usual suspects.
Software Engineer, Discovery
FYI just in case it's of interest and hasn't shown up on the team's radar yet:
Quote from the abstract:
"This paper discusses expressivity and accuracy of the By-Example
Structured (BESt) Query paradigm implemented on the SWiPE system
through the Wikipedia interface. We define an experimental setting
based on the natural language questions made available by the QALD-4
challenge, in which we compare SWiPE against Xser, a state-of-the-art
Question Answering system, and plain keyword search provided by the
Wikipedia Search Engine. The experiments show that SWiPE outperforms
the results provided by Wikipedia, and it also performs sensibly
better than Xser, obtaining an overall 85% of totally correct answers
vs. 68% of Xser."
(For context, there's an earlier paper where they describe an earlier
version of that SWiPE - "Search Wikipedia by example" - project:
IRC (Freenode): HaeB
Several weeks ago we ran an A/B test to try and decrease the number of
searches on Wikipedia returning zero results. This consisted of a
small config change that reduced the confidence needed for our systems
to provide search results, along with a change to the smoothing
algorithm used to bump the quality of the results now provided.
Our intent was not only to reduce the zero results rate but also to
prototype the actual process of A/B testing and identify issues we
could fix to make future tests easier and more reliable - this was the
first A/B test we had run.
5% of searches were registered as a control group, and an additional
5% subject to the reduced confidence and smoothing algorithm change.
An initial analysis over the first day's 7m events was run on 7
August, and a final analysis looking at an entire week of data was
completed yesterday. You can read the full results at
Based on what we've seen we conclude that there is, at best, a
negligible effect from this change - and it's hard to tell if there
even is an effect at all. Accordingly, we recommend the default
behaviour be used for all users, and the experiment disabled.
This may sound like a failure, but it's actually not. For one thing,
we've learned that the config variables here probably aren't our venue
for dramatic changes - the defaults are pretty sensible. If we're
looking for dramatic changes in the rate, we should be applying
dramatic changes in the system's behaviour.
In addition, we identified a lot of process issues we can fix for the
next round of A/B tests, making it easier to analyse the data that
comes in and making the results easier to rely on. These include:
1. Who the gods would destroy, they first bless with real-time
analytics. The dramatic difference between the outcome of the initial
and final analysis speaks partly to the small size of the effect seen
and the power analysis issues mentoned below, but it's also a good
reminder that a single day of data, however many datapoints it
contains, is rarely the answer. User behaviour varies dramatically
depending on the day of the week or the month of the year - a week
should be seen as the minimum testing period for us to have any
confidence in what we see.
2. Power analysis is a must-have. Our hypothesis for the negligible
and totally varying size of the effect is simply the amount of data we
had; when you're looking at millions upon millions of events for a
pair of options, you're going to see patterns - because with enough
data stared at for long enough, pretty much anything can happen. That
doesn't mean it's /real/. In the future we need to be setting our
sample size using proper, a-priori power analysis - or switching to
Bayesian methods where this sort of problem doesn't appear.
These are not new lessons to the org as a whole (at least, I hope not)
but they are nice reminders, and I hope that sharing them allows us to
start building up an org-wide understanding of how we A/B test and the
costs of not doing things in a very deliberate way.
On Tue, Aug 25, 2015 at 7:58 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> So it's a comparison of two search systems, neither of which we use?
It's a comparison of three search systems, one of which is our
in-built search, which they claim is inferior to their system in that
setting ("The experiments show that SWiPE outperforms the results
provided by Wikipedia").
IRC (Freenode): HaeB
Management is requesting that foundation engineering teams consider
tracking how much time we are spending on "maintenance" vs. "new work". As
a first step in that direction, we have been asked to categorize our work
into those two buckets. This should not require a lot of effort, and for
most of you (non leads) won't require any effort at all, at least for now.
But I wanted you to be aware of it.
You'll start to see phab tasks tagged with one of two new tags, which will
probably be #WorkTypeMaintenance and #WorkTypeNewFunctionality. You can
ignore them in your day-to-day work.
The idea is to experiment with this approach for a while, and then evaluate
whether it should be continued, modified, or dropped.
Agile Coach, Wikimedia Foundation