I understand that we are shifting to a "minimum 2-week" cadence, but I'm
not sure exactly what that means. Reading Mikhail's email, it sounds like
we plan to run each test for one week, and then have one week "off" to
analyze those results and to prepare for the following test. Is that true?
Regardless of those details, would it be helpful to have a "recipe" for
each test? To know that on Day T-7, we would be thinking about X, and by
Day T-4, we had better have Y in place. And then to expect Z by day T+8.
Basically, to document all the little steps that might be necessary or
optional before, during, and after a test.
If that seems helpful, I can create a phab task to create and populate a
wiki page with that kind of information. Obviously the population of that
page would have to be a group effort, with input from product, engineering,
analysis, and possibly others.
Agile Coach, Wikimedia Foundation
- This mail is in HTML. Some elements may be ommited in plain text. -
I am Hussein Ali from Syria, presently now with the United Nations
on asylum. I got your contact from a web business directory on
investment. Please I seek your assistance in the following ways:
1.To assist me look for a profitable business in your country (where I can
invest to sustain my living until the political crisis in my country is
2. To assist me purchase a living home, .I have huge sum fifteen million us dollars in
financial institution .Should there be a need for an evidence, or a
prove of my seriousness and genuineness. I have a Certificate of Deposit as
a prove of fund.
Please assist me to come over to your country for resettlement and
investment. I will compensate you greatly for this help. I am also ready to
associate with a local partner, provided
Your Government will give me a Residence Permit.
Could you please send me an email on (syriaoil.aleppo(a)gmail.com ) to enable me know you
have received my email.
Cross-posting from wikitech-l. Please reply there.
---------- Forwarded message ----------
From: Dan Garry <dgarry(a)wikimedia.org>
Date: 1 September 2015 at 20:43
Subject: Discovery Department A/B testing an alternative to prefix search
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
*tl;dr: Discovery Department to run A/B test
<https://phabricator.wikimedia.org/T111078> comparing new search suggester
to prefix search, to see if it can reduce zero results rate.*
As I'm sure you're all aware, the search box at the top right of every page
on desktop uses prefix search to generate its results. The main reason for
this is that prefix search is incredibly fast and performant; that search
box sees a lot of traffic, and it's important to keep it scalable.
However, we know that there are numerous problems with prefix search.
Prefix searches are prone to give you no results; if you make even a slight
typo, then you won't get the result you want. And thus a complex system of
manually curated redirects were born to try to alleviate this navigation
issue. Wouldn't it be nice if we could work towards a solution that doesn't
require the manual curation of redirects, thus freeing up Wikimedians to do
other more meaningful tasks? And make search a bit better in the process,
too? That's a long term goal of mine... emphasis on the long. ;-)
The Q1 2015-17 (Jul - Aug 2015) goal of the Search Team in the Discovery
Department is to reduce the zero results rate
Amongst other things, we've been working to build an alternative to prefix
search <https://phabricator.wikimedia.org/T105746>. Documentation on the
API is pretty light right now because we're scrambling to get it up and
running (but there's a task for that!
An initial version of the suggestion API is now in production on enwiki and
dewiki , but is currently not being used for anything. Our initial tests
<https://phabricator.wikimedia.org/T109729> of the API show that it's
incredibly promising for reducing the zero results rate. But we need more
We're planning on running an A/B test on whether this API is better at
reducing zero results. We're targeting beginning on Tuesday 8th September,
for two weeks. This is documented in T111078
A very important note here is that we currently have no way of
quantitatively measuring result relevance (although we're working on it
<https://phabricator.wikimedia.org/T109482>), so this test will be highly
limited in scope, only measuring the zero results rate. Given the limits of
this, even seeing massive success in this test is not enough to deploy this
API as a full replacement of prefix search; we'd need additional data. But,
that's not stopping us from gathering initial data from this test.
As always, if you have any questions, let me know.
: The API is actually live on all wikis, but we only built the search
indices for enwiki and dewiki since they're our biggest content wikis and
this is an early test. Attempting to use the API on any other wiki will get
you a cirrus backend error.
Lead Product Manager, Discovery
Lead Product Manager, Discovery
Just as an FYI, next Thursday the Discovery's UX sub-team will start having
weekly meetings, to groom the backlog and plan work for the week.
For now, these will include Moiz and Dan, with Tomasz and Wes optional. As
additional UX folks are hired, we'll add them, and we will also consider
bringing in other people as needed. This is reflected on the process
As a reminder, the UX sub-team has its own phabricator sprint board.
It's not being used heavily yet, but that may change over the next few
Agile Coach, Wikimedia Foundation
Last week we discussed our approach to A/B testing and we've decided to
have a week (at least) between tests.
A two-week-minimum cadence will give the analysis team enough time to
thoroughly think about the experimental design of each test, as well as
give the engineers enough time to implement it. Which is great because some
of the changes we are planning to test are not trivial and we don't want to
rush a test out and realize halfway through that we should have been
tracking something we're not.
We are also going to move away from doing initial analyses (analysis of the
data from the morning of a launch) for practical and scientific reasons.
Practical in the sense that we've been putting time and effort into getting
preliminary results that are not representative of final results whatsoever
while putting other work on the backburner. Scientific in the sense that
peeking at the data mid-experiment is bad science:
*Repeated significance testing always increases the rate of false
positives, that is, you’ll think many insignificant results are significant
(but not the other way around). The problem will be present if you ever
find yourself “peeking” at the data and stopping an experiment that seems
to be giving a significant result. The more you peek, the more your
significance levels will be off. For example, if you peek at an ongoing
experiment ten times, then what you think is 1% significance is actually
just 5% significance.* – Evan Miller, How Not To Run An A/B Test
In science, it's a problem called multiple comparisons. The more tests you
perform, the more likely you are to see something where there is nothing.
Going forward, we are going to wait until we have collected all the data
before analyzing it.
Mikhail, Junior Swifty
Discovery // The Swifties