The idea of A/B tests is to try to isolate things. You're not going to get
perfect data all of the time and you'll likely need to retry experiments
with more focus until you can be assured your tests are accurate, but this
is definitely doable in live deployments.
I used editing as an example, but you're right in that it's difficult to get
reliable metrics for a lot of editing actions (though it should be a bit
easier in VE). That's of course why I gave a search example previously,
which is much easier to isolate. In fact, most reader based tests should be
pretty reliable, since the reader feature set is much smaller and the number
of readers is massive. This topic is about skin changes, btw ;).
> Knowing that something occurs isn't the same as knowing why. Take the
English Wikipedia editor decline. There has been a lot of good research on
this subject, and we have confidently identified a set of factors that are
likely contributors. Some of these can be directly measured: the decreased
retention rate of newcomers; the effect of early, negative experiences on
newcomer retention; a measurable increase over time in phenomena (like
reverts, warnings, new article deletions) that likely cause those negative
experiences. But none of us who have studied the editor decline believe that
these are the only factors. And many community members who have read our
research don't even accept our premises, let alone our findings.
>
The best way to solve a complex problem is to first understand the problem
(which you've done through research), then to break it down into small
actionable parts (you've already mentioned them), then to tackle each
problem by proposing solutions, implementing them in a testable way and then
seeing if the results are positive or not.
The results of changing the warnings had pretty strong indications that the
new messages moved the retention numbers in a positive way, right? Why
shouldn't we trust the data there? If the data wasn't good enough, is there
any way to make them more accurate?methodology.
Until the very, very recent past there wasn't even the ability to measure
the simplest of things. There's no real-time or close to real-time
measurements. There's no health dashboards for vital community metrics.
There's no experimentation framework. Since there's no experiment framework
there's no run-time controls for product managers to run A/B tests of
feature flagged features. There's very few analytics events in MediaWiki.
I don't want to sound negative, because I understand why all of this is the
case, since analytics has been poorly resourced, ignored and managed into
the ground until pretty recently, but Wikimedia isn't at the level of most
early startups when it comes to analytics.
Wikimedia does have (and has historically had) excellent researchers that
have been doing amazing work with insanely small amounts of data and
infrastructure.
I was on that project (Usability Initiative) as an ops engineer. I was hired
for it, in fact. I remember that project well and I wouldn't call it a major
success. It was successful in that it changed the default skin to something
that was slightly more modern than Monobook, but it was the only truly
successful part of the entire project. I think Vector is the only surviving
code from it. The vast majority of Vector features didn't make it
permanently into the Vector skin. Mostly what stayed around was the "look
and feel" of the skin.
The community was a lot more accepting of change then, but it was still a
pretty massive battle. The PM of that project nearly worked herself to death.
> Whether or not we (WMF) think it is fair that we have to listen to "vocal
minorities" (Ryan's words), these voices often represent and influence the
sentiments of the broader, less vocal, contributor base in important ways.
And we won't be able to get people to accept our conclusions, however
rigorously we demonstrate them or carefully we couch them in scientific
trappings, if they think we're fundamentally incapable of building something
worthwhile, or deploying it responsibly.
Yeah. Obviously it's necessary to not ship broken or very buggy code, but
that's a different story. It's also a lot easier to know if your code is
broken when you A/B test it before it's shipped. It should be noticeable
from the metrics, or the metrics aren't good enough.
> We can't run our product development like "every non-enterprise software
company worth a damn" (Steven's words), and that shouldn't be our goal. We
aren't a start-up (most of which fail) that can focus all our resources on
one radical new idea. We aren't a tech giant like Google or Facebook, that
can churn out a bunch of different beta products, throw them at a wall and
see what sticks.
What's your proposal that's somehow better than what most of the rest of the
sites on the internet are doing? Maybe you can't do exactly what they're
doing due to lack of resources, but you can at least do the basics.
> And we're not a commercial community-driven site like Quora or Yelp, which
can constantly monkey with their interface and feature set in order to
maximize ad revenue or try out any old half-baked strategy to monetize its
content. There's a fundamental difference between Wikimedia and Quora. In
Quora's case, a for-profit company built a platform and invited people to
use it. In Wikimedia's case, a bunch of volunteers created a platform,
filled it with content, and then a non-profit company was created to support
that platform, content, and community.
>
I don't understand how you can say this. This is exactly how fundraising at
WMF works and it's been shown to be incredibly effective. WMF is most likely
the most effective organization in the world at large-scale small donations.
It's this way because it constantly tests changes to see what's more
effective. It does this using almost exactly the methodology I'm describing.
Why can't we bring a little bit of this awesomeness into the rest of the
engineering organization?