Jonathan Morgan <jmorgan@...> writes:
On Mon, Jul 27, 2015 at 11:02 AM, Ryan Lane
rlane32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
For instance, if a change negatively affects an editor's workflow, it should be reflected in data like "avg/p95/p99 time for x action to occur", where x is some normal editor workflow.
That is indeed one way you can provide evidence of correlation; but in
live deployments (which are, at best, quasi-experiments), you seldom get results that are as unequivocal as the example you're presenting here. And quantifying the influence of a single causal factor (such as the impact of a particular UI change on time-on-task for this or that editing workflow) is even harder.
The idea of A/B tests is to try to isolate things. You're not going to get perfect data all of the time and you'll likely need to retry experiments with more focus until you can be assured your tests are accurate, but this is definitely doable in live deployments.
I used editing as an example, but you're right in that it's difficult to get reliable metrics for a lot of editing actions (though it should be a bit easier in VE). That's of course why I gave a search example previously, which is much easier to isolate. In fact, most reader based tests should be pretty reliable, since the reader feature set is much smaller and the number of readers is massive. This topic is about skin changes, btw ;).
Knowing that something occurs isn't the same as knowing why. Take the
English Wikipedia editor decline. There has been a lot of good research on this subject, and we have confidently identified a set of factors that are likely contributors. Some of these can be directly measured: the decreased retention rate of newcomers; the effect of early, negative experiences on newcomer retention; a measurable increase over time in phenomena (like reverts, warnings, new article deletions) that likely cause those negative experiences. But none of us who have studied the editor decline believe that these are the only factors. And many community members who have read our research don't even accept our premises, let alone our findings.
The best way to solve a complex problem is to first understand the problem (which you've done through research), then to break it down into small actionable parts (you've already mentioned them), then to tackle each problem by proposing solutions, implementing them in a testable way and then seeing if the results are positive or not.
The results of changing the warnings had pretty strong indications that the new messages moved the retention numbers in a positive way, right? Why shouldn't we trust the data there? If the data wasn't good enough, is there any way to make them more accurate?methodology.
I'm not at all afraid of sounding pedantic here (or of writing a long-ass
wall of text), because I think that many WMF and former-WMF participants in this discussion are glossing over important stuff: Yes, we need a more evidence-based product design process. But we also need a more collaborative, transparent, and iterative deployment process. Having solid research and data on the front-end of your product lifecycle is important, but it's not some kind of magic bullet and is no substitute for community involvement in product design (through the lifecycle).
We have an excellent Research & Data team. The best one we've ever had at
WMF. Pound-for-pound, they're as good as or better than the Data Science teams at Google or Facebook. None of them would ever claim, as you seem to here, that all you need to build good products are well-formed hypotheses and access to buckets of log data.
Until the very, very recent past there wasn't even the ability to measure the simplest of things. There's no real-time or close to real-time measurements. There's no health dashboards for vital community metrics. There's no experimentation framework. Since there's no experiment framework there's no run-time controls for product managers to run A/B tests of feature flagged features. There's very few analytics events in MediaWiki.
I don't want to sound negative, because I understand why all of this is the case, since analytics has been poorly resourced, ignored and managed into the ground until pretty recently, but Wikimedia isn't at the level of most early startups when it comes to analytics.
Wikimedia does have (and has historically had) excellent researchers that have been doing amazing work with insanely small amounts of data and infrastructure.
I had a great conversation with Liam Wyatt at Wikimania (cc'ing him, in
case he doesn't follow this list). We talked about strategies for deploying new products on Wikimedia projects: what works, what doesn't. He held up the design/deployment process for Vector as an example of good process, one that we should (re)adopt.
Vector was created based on extensive user research and community
consultation[1]. Then WMF made a beta, and invited people across projects to opt-in and try it out on prototype wikis[2]. The product team set public criteria for when it would release the product as default across production projects: retention of 80% of the Beta users who had opted in, after a certain amount of time. When a beta tester opted out, they were sent a survey to find out why[3]. The product team attempted to triage the issues reported in these surveys, address them in the next iteration, or (if they couldn't/wouldn't fix them), at least publicly acknowledge the feedback. Then they created a phased deployment schedule, and stuck to it[4].
I was on that project (Usability Initiative) as an ops engineer. I was hired for it, in fact. I remember that project well and I wouldn't call it a major success. It was successful in that it changed the default skin to something that was slightly more modern than Monobook, but it was the only truly successful part of the entire project. I think Vector is the only surviving code from it. The vast majority of Vector features didn't make it permanently into the Vector skin. Mostly what stayed around was the "look and feel" of the skin.
The community was a lot more accepting of change then, but it was still a pretty massive battle. The PM of that project nearly worked herself to death.
Whether or not we (WMF) think it is fair that we have to listen to "vocal
minorities" (Ryan's words), these voices often represent and influence the sentiments of the broader, less vocal, contributor base in important ways. And we won't be able to get people to accept our conclusions, however rigorously we demonstrate them or carefully we couch them in scientific trappings, if they think we're fundamentally incapable of building something worthwhile, or deploying it responsibly.
Yeah. Obviously it's necessary to not ship broken or very buggy code, but that's a different story. It's also a lot easier to know if your code is broken when you A/B test it before it's shipped. It should be noticeable from the metrics, or the metrics aren't good enough.
We can't run our product development like "every non-enterprise software
company worth a damn" (Steven's words), and that shouldn't be our goal. We aren't a start-up (most of which fail) that can focus all our resources on one radical new idea. We aren't a tech giant like Google or Facebook, that can churn out a bunch of different beta products, throw them at a wall and see what sticks.
What's your proposal that's somehow better than what most of the rest of the sites on the internet are doing? Maybe you can't do exactly what they're doing due to lack of resources, but you can at least do the basics.
And we're not a commercial community-driven site like Quora or Yelp, which
can constantly monkey with their interface and feature set in order to maximize ad revenue or try out any old half-baked strategy to monetize its content. There's a fundamental difference between Wikimedia and Quora. In Quora's case, a for-profit company built a platform and invited people to use it. In Wikimedia's case, a bunch of volunteers created a platform, filled it with content, and then a non-profit company was created to support that platform, content, and community.
I don't understand how you can say this. This is exactly how fundraising at WMF works and it's been shown to be incredibly effective. WMF is most likely the most effective organization in the world at large-scale small donations. It's this way because it constantly tests changes to see what's more effective. It does this using almost exactly the methodology I'm describing. Why can't we bring a little bit of this awesomeness into the rest of the engineering organization?
- Ryan