Jonathan Morgan <jmorgan@...> writes:
On Mon, Jul 27, 2015 at 11:02 AM, Ryan Lane
For instance, if a change negatively affects an editor's workflow, it should
be reflected in data like "avg/p95/p99 time for x action to occur", where x
is some normal editor workflow.
That is indeed one way you can provide evidence of correlation; but in
deployments (which are, at best, quasi-experiments), you seldom get
results that are as unequivocal as the example you're presenting here. And
quantifying the influence of a single causal factor (such as the impact of a
particular UI change on time-on-task for this or that editing workflow) is
The idea of A/B tests is to try to isolate things. You're not going to get
perfect data all of the time and you'll likely need to retry experiments
with more focus until you can be assured your tests are accurate, but this
is definitely doable in live deployments.
I used editing as an example, but you're right in that it's difficult to get
reliable metrics for a lot of editing actions (though it should be a bit
easier in VE). That's of course why I gave a search example previously,
which is much easier to isolate. In fact, most reader based tests should be
pretty reliable, since the reader feature set is much smaller and the number
of readers is massive. This topic is about skin changes, btw ;).
Knowing that something occurs isn't the same as
knowing why. Take the
English Wikipedia editor decline. There has been a lot of good
this subject, and we have confidently identified a set of factors that are
likely contributors. Some of these can be directly measured: the decreased
retention rate of newcomers; the effect of early, negative experiences on
newcomer retention; a measurable increase over time in phenomena (like
reverts, warnings, new article deletions) that likely cause those negative
experiences. But none of us who have studied the editor decline believe that
these are the only factors. And many community members who have read our
research don't even accept our premises, let alone our findings.
The best way to solve a complex problem is to first understand the problem
(which you've done through research), then to break it down into small
actionable parts (you've already mentioned them), then to tackle each
problem by proposing solutions, implementing them in a testable way and then
seeing if the results are positive or not.
The results of changing the warnings had pretty strong indications that the
new messages moved the retention numbers in a positive way, right? Why
shouldn't we trust the data there? If the data wasn't good enough, is there
any way to make them more accurate?methodology.
I'm not at all afraid of sounding pedantic here
(or of writing a long-ass
wall of text), because I think that many WMF and
former-WMF participants in
this discussion are glossing over important stuff: Yes, we need a more
evidence-based product design process. But we also need a more
collaborative, transparent, and iterative deployment process. Having solid
research and data on the front-end of your product lifecycle is important,
but it's not some kind of magic bullet and is no substitute for community
involvement in product design (through the lifecycle).
> We have an excellent Research & Data
team. The best one we've ever had at
WMF. Pound-for-pound, they're as good as or better than the Data Science
teams at Google or Facebook. None of them would ever claim, as you seem to
here, that all you need to build good products are well-formed hypotheses
and access to buckets of log data.
Until the very, very recent past there wasn't even the ability to measure
the simplest of things. There's no real-time or close to real-time
measurements. There's no health dashboards for vital community metrics.
There's no experimentation framework. Since there's no experiment framework
there's no run-time controls for product managers to run A/B tests of
feature flagged features. There's very few analytics events in MediaWiki.
I don't want to sound negative, because I understand why all of this is the
case, since analytics has been poorly resourced, ignored and managed into
the ground until pretty recently, but Wikimedia isn't at the level of most
early startups when it comes to analytics.
Wikimedia does have (and has historically had) excellent researchers that
have been doing amazing work with insanely small amounts of data and
I had a great conversation with Liam Wyatt at
Wikimania (cc'ing him, in
case he doesn't follow this list). We talked about
strategies for deploying
new products on Wikimedia projects: what works, what doesn't. He held up the
design/deployment process for Vector as an example of good process, one that
we should (re)adopt.
> Vector was created based on extensive user
research and community
consultation. Then WMF made a beta, and invited people across projects to
opt-in and try it out on prototype wikis. The product team set public
criteria for when it would release the product as default across production
projects: retention of 80% of the Beta users who had opted in, after a
certain amount of time. When a beta tester opted out, they were sent a
survey to find out why. The product team attempted to triage the issues
reported in these surveys, address them in the next iteration, or (if they
couldn't/wouldn't fix them), at least publicly acknowledge the feedback.
Then they created a phased deployment schedule, and stuck to it.
I was on that project (Usability Initiative) as an ops engineer. I was hired
for it, in fact. I remember that project well and I wouldn't call it a major
success. It was successful in that it changed the default skin to something
that was slightly more modern than Monobook, but it was the only truly
successful part of the entire project. I think Vector is the only surviving
code from it. The vast majority of Vector features didn't make it
permanently into the Vector skin. Mostly what stayed around was the "look
and feel" of the skin.
The community was a lot more accepting of change then, but it was still a
pretty massive battle. The PM of that project nearly worked herself to death.
Whether or not we (WMF) think it is fair that we have
to listen to "vocal
minorities" (Ryan's words), these voices often
represent and influence the
sentiments of the broader, less vocal, contributor base in important ways.
And we won't be able to get people to accept our conclusions, however
rigorously we demonstrate them or carefully we couch them in scientific
trappings, if they think we're fundamentally incapable of building something
worthwhile, or deploying it responsibly.
Yeah. Obviously it's necessary to not ship broken or very buggy code, but
that's a different story. It's also a lot easier to know if your code is
broken when you A/B test it before it's shipped. It should be noticeable
from the metrics, or the metrics aren't good enough.
We can't run our product development like
"every non-enterprise software
company worth a damn" (Steven's words),
and that shouldn't be our goal. We
aren't a start-up (most of which fail) that can focus all our resources on
one radical new idea. We aren't a tech giant like Google or Facebook, that
can churn out a bunch of different beta products, throw them at a wall and
see what sticks.
What's your proposal that's somehow better than what most of the rest of the
sites on the internet are doing? Maybe you can't do exactly what they're
doing due to lack of resources, but you can at least do the basics.
And we're not a commercial community-driven site
like Quora or Yelp, which
can constantly monkey with their interface and feature set
in order to
maximize ad revenue or try out any old half-baked strategy to monetize its
content. There's a fundamental difference between Wikimedia and Quora. In
Quora's case, a for-profit company built a platform and invited people to
use it. In Wikimedia's case, a bunch of volunteers created a platform,
filled it with content, and then a non-profit company was created to support
that platform, content, and community.
I don't understand how you can say this. This is exactly how fundraising at
WMF works and it's been shown to be incredibly effective. WMF is most likely
the most effective organization in the world at large-scale small donations.
It's this way because it constantly tests changes to see what's more
effective. It does this using almost exactly the methodology I'm describing.
Why can't we bring a little bit of this awesomeness into the rest of the