On Mon, Jul 27, 2015 at 11:02 AM, Ryan Lane <rlane32@gmail.com> wrote:

For instance, if a change negatively affects an editor's workflow, it should
be reflected in data like "avg/p95/p99 time for x action to occur", where x
is some normal editor workflow.


That is indeed one way you can provide evidence of correlation; but in live deployments (which are, at best, quasi-experiments), you seldom get results that are as unequivocal as the example you're presenting here.  And quantifying the influence of a single causal factor (such as the impact of a particular UI change on time-on-task for this or that editing workflow) is even harder.

Knowing that something occurs isn't the same as knowing why. Take the English Wikipedia editor decline. There has been a lot of good research on this subject, and we have confidently identified a set of factors that are likely contributors. Some of these can be directly measured: the decreased retention rate of newcomers; the effect of early, negative experiences on newcomer retention; a measurable increase over time in phenomena (like reverts, warnings, new article deletions) that likely cause those negative experiences. But none of us who have studied the editor decline believe that these are the only factors. And many community members who have read our research don't even accept our premises, let alone our findings.

I'm not at all afraid of sounding pedantic here (or of writing a long-ass wall of text), because I think that many WMF and former-WMF participants in this discussion are glossing over important stuff: Yes, we need a more evidence-based product design process. But we also need a more collaborative, transparent, and iterative deployment process. Having solid research and data on the front-end of your product lifecycle is important, but it's not some kind of magic bullet and is no substitute for community involvement in product design (through the lifecycle).

We have an excellent Research & Data team. The best one we've ever had at WMF. Pound-for-pound, they're as good as or better than the Data Science teams at Google or Facebook. None of them would ever claim, as you seem to here, that all you need to build good products are well-formed hypotheses and access to buckets of log data. 

I had a great conversation with Liam Wyatt at Wikimania (cc'ing him, in case he doesn't follow this list). We talked about strategies for deploying new products on Wikimedia projects: what works, what doesn't. He held up the design/deployment process for Vector as an example of good process, one that we should (re)adopt. 

Vector was created based on extensive user research and community consultation[1]. Then WMF made a beta, and invited people across projects to opt-in and try it out on prototype wikis[2]. The product team set public criteria for when it would release  the product as default across production projects: retention of 80% of the Beta users who had opted in, after a certain amount of time. When a beta tester opted out, they were sent a survey to find out why[3]. The product team attempted to triage the issues reported in these surveys, address them in the next iteration, or (if they couldn't/wouldn't fix them), at least publicly acknowledge the feedback. Then they created a phased deployment schedule, and stuck to it[4]. 

This was, according to Liam (who's been around the movement a lot longer than most of us at WMF), a successful strategy. It built trust, and engaged volunteers as both evangelists and co-designers. I am personally very eager to hear from other community members who were around at the time what they thought of the process, and/or whether there are other examples of good WMF product deployments that we could crib from as we re-assess our current process. From what I've seen, we still follow many good practices in our product deployments, but we follow them haphazardly and inconsistently. 

Whether or not we (WMF) think it is fair that we have to listen to "vocal minorities" (Ryan's words), these voices often represent and influence the sentiments of the broader, less vocal, contributor base in important ways. And we won't be able to get people to accept our conclusions, however rigorously we demonstrate them or carefully we couch them in scientific trappings, if they think we're fundamentally incapable of building something worthwhile, or deploying it responsibly.

We can't run our product development like "every non-enterprise software company worth a damn" (Steven's words), and that shouldn't be our goal. We aren't a start-up (most of which fail) that can focus all our resources on one radical new idea. We aren't a tech giant like Google or Facebook, that can churn out a bunch of different beta products, throw them at a wall and see what sticks. 

And we're not a commercial community-driven site like Quora or Yelp, which can constantly monkey with their interface and feature set in order to maximize ad revenue or try out any old half-baked strategy to monetize its content. There's a fundamental difference between Wikimedia and Quora. In Quora's case, a for-profit company built a platform and invited people to use it. In Wikimedia's case, a bunch of volunteers created a platform, filled it with content, and then a non-profit company was created to support that platform, content, and community. 

Our biggest opportunity to innovate, as a company, is in our design process. We have a dedicated, multi-talented, active community of contributors. Those of us who are getting paid should be working on strategies for leveraging that community to make better products, rather than trying to come up with new ways to perform end runs around them.

Jonathan