Responses inline!
On Mon, Jul 27, 2015 at 10:52 PM, Ryan Lane rlane32@gmail.com wrote:
The idea of A/B tests is to try to isolate things. You're not going to get perfect data all of the time and you'll likely need to retry experiments with more focus until you can be assured your tests are accurate, but this is definitely doable in live deployments.
I used editing as an example, but you're right in that it's difficult to get reliable metrics for a lot of editing actions (though it should be a bit easier in VE). That's of course why I gave a search example previously, which is much easier to isolate. In fact, most reader based tests should be pretty reliable, since the reader feature set is much smaller and the number of readers is massive. This topic is about skin changes, btw ;).
We started out talking about skin changes, but we've been in meta-discussion-land for a few days now. That's not surprising (we touched on probably the biggest perennial conflicts between WMF and the editing community). What's surprising to me is that, so far, this time the discussion has been both frank and relatively proactive. So I want to ride this wave as far as it takes us.
A/B tests are great, and we should use them more often for reader-facing UI. But a new default skin isn't just reader-facing; it's everyone-facing. Making things easier, more engaging, or more delightful for non-editors isn't going to do us much good if it makes things harder, less engaging, or less delightful for editors.
There are definitely products that are primarily reader-facing. But most of our products (and certainly the default skin) have a substantial impact on the editing experience as well. Earlier, you said the editor community "should be worked around when changes are meant to affect readers and those changes don't directly negatively affect editor metrics." I counter that: a) there is no single editor metric, or set of metrics, that we can use to fully determine the impact of this or that design change on the editing experience of Wikipedia. b) even if there were such metrics, it would be highly counterproductive for WMF to say to editors "we don't care about your experiences, just your aggregate performance". Also, dickish.
Because I see two issues at play here, and I think they are inextricably linked: We need to be more evidence-driven, and we need more, not less, community involvement in our design process.
If we don't become more evidence-driven (which requires updates to both or processes and our infrastructure), we will always struggle to build products that meet the needs of our users (readers, editors, third-party MediaWiki peeps).
But *whether or not we become more evidence-driven, *we will always struggle to get the products we build implemented, if our most powerful user group doesn't currently trust us to act in their best interest. Or even our own.
Knowing that something occurs isn't the same as knowing why. Take the
English Wikipedia editor decline. There has been a lot of good research on this subject, and we have confidently identified a set of factors that are likely contributors. Some of these can be directly measured: the decreased retention rate of newcomers; the effect of early, negative experiences on newcomer retention; a measurable increase over time in phenomena (like reverts, warnings, new article deletions) that likely cause those negative experiences. But none of us who have studied the editor decline believe that these are the only factors. And many community members who have read our research don't even accept our premises, let alone our findings.
The best way to solve a complex problem is to first understand the problem (which you've done through research), then to break it down into small actionable parts (you've already mentioned them), then to tackle each problem by proposing solutions, implementing them in a testable way and then seeing if the results are positive or not.
The results of changing the warnings had pretty strong indications that the new messages moved the retention numbers in a positive way, right? Why shouldn't we trust the data there? If the data wasn't good enough, is there any way to make them more accurate?methodology.
The data were good :) Actually, Snuggle and the Teahouse both came out of this line of research. These two products share several features, that Winter (and most of our major products) don't:
1. They are permanently opt-in: no person has to use them. No Wikimedia project has to adopt them. 2. They add functionality, rather than replacing it. 3. They are incrementalist approaches to addressing a major issue identified through careful front-end research. 4. They were designed in collaboration (not just consultation) with editors. 4. They are powered (to this day) by dedicated volunteers who are invested in their success. 5. They were cheap to build, and are cheap to maintain.
Some of these features probably limit their overall impact. But they virtually assure their long-term sustainability, which means they can keep on addressing the newcomer retention problem, even after the grants/dissertations that supported their development are gone. FWIW, many other new editor engagement products have had to be scuttled after the product team that developed them (and championed them) was disbanded, or the Foundation's priorities changed.
I'm not suggesting that this design approach offers a template for how to make people <3 VE or whatever, but there are lessons here about how to do evidence-based design well, and about the advantages of getting core contributors to feel invested in what you build.
Until the very, very recent past there wasn't even the ability to measure the simplest of things. There's no real-time or close to real-time measurements. There's no health dashboards for vital community metrics. There's no experimentation framework. Since there's no experiment framework there's no run-time controls for product managers to run A/B tests of feature flagged features. There's very few analytics events in MediaWiki.
I don't want to sound negative, because I understand why all of this is the case, since analytics has been poorly resourced, ignored and managed into the ground until pretty recently, but Wikimedia isn't at the level of most early startups when it comes to analytics.
Wikimedia does have (and has historically had) excellent researchers that have been doing amazing work with insanely small amounts of data and infrastructure.
I didn't think you were dissing the researchers; sorry if it came off that way. My point was that our research & data team know that a) A/B tests alone aren't usually sufficient to justify major design changes and b) good science won't convince anyone if they already mistrust or dislike you. Leila and Aaron, for example, have had to invest a lot of time explaining, contextualizing, defending their research, trying to (re)build trust so that people will give their research a fair hearing.
I was on that project (Usability Initiative) as an ops engineer. I was hired for it, in fact. I remember that project well and I wouldn't call it a major success. It was successful in that it changed the default skin to something that was slightly more modern than Monobook, but it was the only truly successful part of the entire project. I think Vector is the only surviving code from it. The vast majority of Vector features didn't make it permanently into the Vector skin. Mostly what stayed around was the "look and feel" of the skin.
The community was a lot more accepting of change then, but it was still a pretty massive battle. The PM of that project nearly worked herself to death.
Right! It's way harder now. All of us whose jobs require us to interact with community members around product design have to fight that battle. There's a lot of mistrust: we're perceived by many as being incompetent, and/or acting in bad faith vis a vis the core contributors to Wikimedia projects. It really sucks sometimes.
But we, as an organization (if not as individuals), bear a good deal of responsibility for the state we're in. A lot of it stems from the way we have designed and deployed products in the past. Fixing that requires more than more research and better testing infrastructure. And perpetuating the meme that the community is afraid of change and that's why we can't have nice things... certainly doesn't help.
Whether or not we (WMF) think it is fair that we have to listen to "vocal
minorities" (Ryan's words), these voices often represent and influence the sentiments of the broader, less vocal, contributor base in important ways. And we won't be able to get people to accept our conclusions, however rigorously we demonstrate them or carefully we couch them in scientific trappings, if they think we're fundamentally incapable of building something worthwhile, or deploying it responsibly.
Yeah. Obviously it's necessary to not ship broken or very buggy code, but that's a different story. It's also a lot easier to know if your code is broken when you A/B test it before it's shipped. It should be noticeable from the metrics, or the metrics aren't good enough.
We can't run our product development like "every non-enterprise software
company worth a damn" (Steven's words), and that shouldn't be our goal. We aren't a start-up (most of which fail) that can focus all our resources on one radical new idea. We aren't a tech giant like Google or Facebook, that can churn out a bunch of different beta products, throw them at a wall and see what sticks.
What's your proposal that's somehow better than what most of the rest of the sites on the internet are doing? Maybe you can't do exactly what they're doing due to lack of resources, but you can at least do the basics.
My proposal is that we should follow a more participatory design process. Better tools and research are necessary, but insufficient. And the "consulting" model that Quora uses isn't appropriate to Wikimedia.
It sounds to me like you and Steven think that we can build faster and better if we distance ourselves more from the community--abstracting their experience as metrics, and limiting their participation to consultation. But I don't think that what's slowing us down is our efforts to work with communities around what we deploy, where we deploy it, and when. I think what slows us down is that we constantly say that we're open and collaborative, but often fail to be open and collaborative when it matters most. This engenders mistrust, which makes it harder for us to experiment, delays deployments, results in buggier, less usable, and less useful products, and virtually guarantees that many of our core users are going to defer or actively resist adopting what we build.
In order to dig ourselves out, let's pursue a two-pronged strategy of: a) evidence driven product development: using quantitative and qualitative research to decide what to build and how to build it b) a transparent, iterative, and participatory process: telling people what intend to build, when and under what circumstances we intend to deploy it, and consistently addressing the feedback we get from people at every stage, in good faith
We won't ever succeed with a) if we don't show that we can implement b) consistently.
And we're not a commercial community-driven site like Quora or Yelp,
which can constantly monkey with their interface and feature set in order to maximize ad revenue or try out any old half-baked strategy to monetize its content. There's a fundamental difference between Wikimedia and Quora. In Quora's case, a for-profit company built a platform and invited people to use it. In Wikimedia's case, a bunch of volunteers created a platform, filled it with content, and then a non-profit company was created to support that platform, content, and community.
I don't understand how you can say this. This is exactly how fundraising at WMF works and it's been shown to be incredibly effective. WMF is most likely the most effective organization in the world at large-scale small donations. It's this way because it constantly tests changes to see what's more effective. It does this using almost exactly the methodology I'm describing. Why can't we bring a little bit of this awesomeness into the rest of the engineering organization?
Fundraising is great! I love fundraising. And not just because they pay my salary--they have great research and an enviable testing infrastructure. But tracking the performance of banners that drive monetary contributions is a fundamentally different task from tracking the performance (<-- not sure that word even applies) of a whole new default UI that fundamentally changes the way both casual readers and dedicated editors interact with Wikipedia. Fundraising products, and the process by which we design and evaluate them, aren't representative of our big software products like Mobile site/apps, Content Translation, VE, Flow, etc.
That's why I'm pushing on your "we can make it work through A/B testing" thesis around deploying something as radical and complex as Winter, as opposed to iterating on Vector. A whole new skin affects everyone's experience of the site in complex and multifaceted ways; there's no single (or even primary) metric of performance. And we can't expect to short-cut the design process or short-circuit community involvement. The only way out is through.
Jonathan