Dear all,
We've talked several times about resource and value, and there has emerged a clear divide between some who see the valuable resources at stake in this as programmer time and IT resource, and my view that the valuable resource we have is our editors' time and patience. Mostly I've expressed that in terms of throttling spam - we don't know how many surveys and how much overlap in surveys our editors will accept before they dis-enable Email, add research survey sites to the spam filter or start blocking researchers even if we've authorised them. In my view nobody wins if we wait until "the tragedy of the commons" has struck and all researchers have permanently lost access to a large proportion of our editors.
But there is another aspect where I think we may have been talking at cross purposes, and that's in our perception of the commercial value of research access to our community, and the motives of the researchers who have approached us. Wikimedia is a long established top ten website and one of the most famous examples of crowd sourcing and online communities. Most of the other successful websites wouldn't dream of allowing a competitor or potential competitor to conduct such research on their community - major websites are worth billions, so an insight from research on another community could be incredibly valuable. Our position is different, we are open to the re-use of our data for commercial purposes per CC-by-SA and a permissive approach to research as compatible with that. I haven't asked what commercial sponsors if any have funded the work of the various researchers who approach us, and I'd be happy for that to continue, provided we keep three safeguards:
I Open licensing. Anyone who wants to broadcast research surveys to our editing community needs to agree that the anonymised results of those surveys will be available under cc-by-sa, and not just a statistical digest but the actual dataset so that variables can be cross tabbed. But I can live with the researcher(s) also having a copy of the data under a different copyright if they are narrowcasting to a small group of editors rather than broadcasting to a large group.
II Timeliness. The cc-by-sa anonymised dataset needs to be published pretty much as soon as it could be, and not kept back until after the researcher has published their analysis of it.
III Transparency. The nightmare scenario to me would be if a top thousand website or aspirant:
1. Sponsors some Academics to do research in an area where they are having difficulty or want to improve their own online community. 2. Sponsors Wikimedia (most of our money comes from individuals, but sometimes a company gives us a few thousand dollars) 3. Their sponsored researcher has private discussions with some or all of us, and gets dispensation not to release part or all of the data they collect in a way that would enable their sponsor's competitors to get the same benefit of it. 4. Either they attribute part of their subsequent turnaround to "insights achieved via research sponsored on Wikimedia", or someone independently links the three previous points and accuses the WMF of selling research access to its editorship, and selling it cheaply.
So far the only argument I've seen for confidentiality is that researchers don't want the data subjects to have a preview of the questions as that could skew the results. I'd accept that as reasonable, if a bit tenuous - the chance of there being a significant overlap between this list and any conceivable research sample is low. But it could be resolved by holding the discussion on an Email thread that doesn't get posted until after the surveys are posted.
Regards
WSC
I Open licensing. Anyone who wants to broadcast research surveys to our editing community needs to agree that the anonymised results of those surveys will be available under cc-by-sa, and not just a statistical
digest
but the actual dataset so that variables can be cross tabbed. But I can live with the researcher(s) also having a copy of the data under a different copyright if they are narrowcasting to a small group of editors rather
than
broadcasting to a large group.
Now it seems to me that the disagreement between you and Aaron during the last RCom meeting was a misunderstanding. I would agree (and, hopefully, Aaron would agree too) that the RAW results of the survey should be cc-by-sa licenced. On the other hand, the results of the research itself (by this I mean the analysis and conclusions the researchers make from the raw results, and/or eventually manuscripts) should be open access, but I would not require the cc-by-sa for the manuscript.
II Timeliness. The cc-by-sa anonymised dataset needs to be published
pretty
much as soon as it could be, and not kept back until after the
researcher
has published their analysis of it.
May be some fixed period would be reasonable? Let us say one month after the end of the survey? This is enough to analyze the data without fearing the competition from other researchers.
III Transparency. The nightmare scenario to me would be if a top
thousand
website or aspirant:
- Sponsors some Academics to do research in an area where they are
having difficulty or want to improve their own online community. 2. Sponsors Wikimedia (most of our money comes from individuals, but sometimes a company gives us a few thousand dollars) 3. Their sponsored researcher has private discussions with some or
all
of us, and gets dispensation not to release part or all of the data they collect in a way that would enable their sponsor's competitors to get the same benefit of it. 4. Either they attribute part of their subsequent turnaround to "insights achieved via research sponsored on Wikimedia", or someone
independently
links the three previous points and accuses the WMF of selling
research
access to its editorship, and selling it cheaply.
So far the only argument I've seen for confidentiality is that
researchers
don't want the data subjects to have a preview of the questions as that could skew the results. I'd accept that as reasonable, if a bit tenuous
-
the chance of there being a significant overlap between this list and
any
conceivable research sample is low. But it could be resolved by holding
the
discussion on an Email thread that doesn't get posted until after the surveys are posted.
I think pooling to share the questions (as we discussed at the meeting) would be possible without actually disclose the questions to the public - we just need to find a proper way to do it, like may be OTRS. I think asking about sponsorship is pretty much possible and reasonable.
Cheers Yaroslav
I have a couple of notes to add.
*Publishing datasets:* Anonymization can take time and should take careful thought. Rushing publication of a dataset is probably not a good idea in that respect. Also, I feel that it is unreasonable to expect researchers to publish their dataset before their work has been accepted for publication. In scientific publishing, being first is everything. It is possible to spend substantial amounts of time (years even) working on a project only to receive no credit for being "scooped". I'm very wary of proposing draconian restrictions such as these without involving the larger researcher community.
Survey particularly often contain private data that must be anonymized before release. If we are considering such anonymized data the "RAW results" of a survey, I would fully agree that that should be the version to be published--as opposed to some limited subset with questions removed for other reasons than to preserve anonymity.
For datasets, I find that licenses like cc-by-sa to be too restrictive in that it requires that derivative works be equally as open. I'd much rather allow the person who uses a dataset decide for themselves how to license their own work. It's important to make the dataset widely available, but I think that specifying future licensing unnecessarily restrictive. For example, IANAL, but I imagine that if someone wanted to publish a book with a plot of a cc-by-sa dataset (derivative work), they'd have to license the book cc-by-sa.
*Publishing the manuscript:* As far as the manuscript goes, I still feel confident that requiring researchers to give up rights for modification/derivative works could never work. However, I'm open to the idea of requiring researchers to giving up rights for distribution to a trusted party (e.g. the WMF, arXiv.org, etc.) to ensure that the manuscript remains freely available to the community. This is possible with the current licensing structure employed by ACM (who I am most familiar with) if the manuscript to be distributed was created prior to submission to ACM (i.e. pre-editing/pre-review/pre-print). See arXiv.org as an example of a mass collection of preprint science.
-Aaron
On Fri, Sep 9, 2011 at 3:41 PM, Yaroslav M. Blanter putevod@mccme.ruwrote:
I Open licensing. Anyone who wants to broadcast research surveys to our editing community needs to agree that the anonymised results of those surveys will be available under cc-by-sa, and not just a statistical
digest
but the actual dataset so that variables can be cross tabbed. But I can live with the researcher(s) also having a copy of the data under a different copyright if they are narrowcasting to a small group of editors rather
than
broadcasting to a large group.
Now it seems to me that the disagreement between you and Aaron during the last RCom meeting was a misunderstanding. I would agree (and, hopefully, Aaron would agree too) that the RAW results of the survey should be cc-by-sa licenced. On the other hand, the results of the research itself (by this I mean the analysis and conclusions the researchers make from the raw results, and/or eventually manuscripts) should be open access, but I would not require the cc-by-sa for the manuscript.
II Timeliness. The cc-by-sa anonymised dataset needs to be published
pretty
much as soon as it could be, and not kept back until after the
researcher
has published their analysis of it.
May be some fixed period would be reasonable? Let us say one month after the end of the survey? This is enough to analyze the data without fearing the competition from other researchers.
III Transparency. The nightmare scenario to me would be if a top
thousand
website or aspirant:
- Sponsors some Academics to do research in an area where they are
having difficulty or want to improve their own online community. 2. Sponsors Wikimedia (most of our money comes from individuals, but sometimes a company gives us a few thousand dollars) 3. Their sponsored researcher has private discussions with some or
all
of us, and gets dispensation not to release part or all of the data they collect in a way that would enable their sponsor's competitors to get the same benefit of it. 4. Either they attribute part of their subsequent turnaround to "insights achieved via research sponsored on Wikimedia", or someone
independently
links the three previous points and accuses the WMF of selling
research
access to its editorship, and selling it cheaply.
So far the only argument I've seen for confidentiality is that
researchers
don't want the data subjects to have a preview of the questions as that could skew the results. I'd accept that as reasonable, if a bit tenuous
the chance of there being a significant overlap between this list and
any
conceivable research sample is low. But it could be resolved by holding
the
discussion on an Email thread that doesn't get posted until after the surveys are posted.
I think pooling to share the questions (as we discussed at the meeting) would be possible without actually disclose the questions to the public - we just need to find a proper way to do it, like may be OTRS. I think asking about sponsorship is pretty much possible and reasonable.
Cheers Yaroslav
RCom-l mailing list RCom-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/rcom-l