New subject: Privacy commercial value and what I perceive as a pitfall

10 Sep 2011


      Dear all,
We've talked several times about resource and value, and there has emerged a
clear divide between some who see the valuable resources at stake in this as
programmer time and IT resource, and my view that the valuable resource we
have is our editors' time and patience. Mostly I've expressed that in terms
of throttling spam - we don't know how many surveys and how much overlap in
surveys our editors will accept before they dis-enable Email, add research
survey sites to the spam filter or start blocking researchers even if we've
authorised them. In my view nobody wins if we wait until "the tragedy of the
commons" has struck and all researchers have permanently lost access to a
large proportion of our editors.
But there is another aspect where I think we may have been talking at cross
purposes, and that's in our perception of the commercial value of research
access to our community, and the motives of the researchers who have
approached us. Wikimedia is a long established top ten website and one of
the most famous examples of crowd sourcing and online communities. Most of
the other successful websites wouldn't dream of allowing a competitor or
potential competitor to conduct such research on their community - major
websites are worth billions, so an insight from research on another
community  could be incredibly valuable. Our position is different, we are
open to the re-use of our data for commercial purposes per CC-by-SA and a
permissive approach to research as compatible with that.  I haven't asked
what commercial sponsors if any have funded the work of the various
researchers who approach us, and I'd be happy for that to continue, provided
we keep three safeguards:
I Open licensing. Anyone who wants to broadcast research surveys to our
editing community needs to agree that the anonymised results of those
surveys will be available under cc-by-sa, and not just a statistical digest
but the actual dataset so that variables can be cross tabbed.  But I can
live with the researcher(s) also having a copy of the data under a different
copyright if they are narrowcasting to a small group of editors rather than
broadcasting to a large group.
II Timeliness. The cc-by-sa anonymised dataset needs to be published pretty
much as soon as it could be, and not kept back until after the researcher
has published their analysis of it.
III Transparency. The nightmare scenario to me would be if a top thousand
website or aspirant:
1. Sponsors some Academics to do research in an area where they are
   having difficulty or want to improve their own online community.
   2. Sponsors Wikimedia (most of our money comes from individuals, but
   sometimes a company gives us a few thousand dollars)
   3. Their sponsored researcher has private discussions with some or all of
   us, and gets dispensation not to release part or all of the data they
   collect in a way that would enable their sponsor's competitors to get the
   same benefit of it.
   4. Either they attribute part of their subsequent turnaround to "insights
   achieved via research sponsored on Wikimedia", or someone independently
   links the three previous points and accuses the WMF of selling research
   access to its editorship, and selling it cheaply.
So far the only argument I've seen for confidentiality is that researchers
don't want the data subjects to have a preview of the questions as that
could skew the results. I'd accept that as reasonable, if a bit tenuous -
the chance of there being a significant overlap between this list and any
conceivable research sample is low. But it could be resolved by holding the
discussion on an Email thread that doesn't get posted until after the
surveys are posted.
Regards
WSC