Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
http://en.wikipedia.org/wiki/Wikipedia:Research
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research
--
Bryan Song
GroupLens Research
University of Minnesota
How would you anonymize data? This is very difficult. If a user is
pseudonomized with a random identifier it is not difficult to
triangularize the user. This is particular the case if the user is a
Wikipedian: The user will often read his/her own user talk page and the
pages s/he edits.
Readings:
https://en.wikipedia.org/wiki/AOL_search_data_leakhttps://en.wikipedia.org/wiki/Differential_privacy#Netflix_Prize
best regards
Finn Årup Nielsen
Den 29-12-2014 kl. 04:53 skrev Ditty Mathew:
> The exact user information is not needed. The anonymized data is enough.
> What exactly we need is the navigation path of Wikipedia readers.
>
> with regards
>
> Ditty
>
> On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes <okeyes(a)wikimedia.org
> <mailto:okeyes@wikimedia.org>> wrote:
>
> Afraid not. First, we do not have some of those datapoints; we do
> not currently have unique user IDs. And, second, it would be a
> tremendous ethical violation for us to release that data that we
> /do/ have (IP addresses, for example).
>
> On 28 December 2014 at 21:00, Ditty Mathew <dittyvkm(a)gmail.com
> <mailto:dittyvkm@gmail.com>> wrote:
>
> Hi,
>
> Is the reader's click log data(should contain user id/ip,
> article title, timestamp) is available for Wikipedia.
>
> with regards
>
> Ditty
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> <mailto:Wiki-research-l@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> <mailto:Wiki-research-l@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
Dear Ditty,
If this is of any help, researchers at Stanford had studied the article
navigation behavior of users when the article being searched for is known.
The data and relevant publications can be found here:
http://snap.stanford.edu/data/wikispeedia.html
Cheers
Srijan
> The exact user information is not needed. The anonymized data is enough.
> What exactly we need is the navigation path of Wikipedia readers.
>
> with regards
>
> Ditty
>
I’m glad to announce the release of an open-licensed corpus with 1.5M records from the Article Feedback v5 pilot.
http://dx.doi.org/10.6084/m9.figshare.1277784
Thanks to everyone who helped make this happen, Fabrice in particular for shepherding this through.
Dario
—
This dataset contains the entire corpus of feedback submitted on the English, French and German Wikipedia during the Article Feedback v.5 pilot (AFT). [1] The Wikimedia Foundation ran the Article Feedback pilot for a year between March 2013 and March 2014. During the pilot, 1,549,842 feedback messages were collected across the three languages.
All feedback messages and their metadata (as described in this schema [2]) are available in this dataset, with the exception of messages that have been oversighted and/or deleted by the end of the pilot.
The corpus is released [3] under the following license:
• CC BY SA 3.0 for feedback messages
• CC0 for the associated metadata
Results from the pilot are discussed in: Halfaker, A., Keyes, O. and Taraborelli, D (2013). Making peripheral participation legitimate: Reader engagement experiments in Wikipedia. CSCW ’13 Proceedings of the 2013 Conference on Computer Supported Cooperative Work [4][5]
[1] https://www.mediawiki.org/wiki/Article_feedback/Version_5
[2] https://www.mediawiki.org/wiki/Article_feedback/Version_5/Technical_Design_…
[3] https://wikimediafoundation.org/wiki/Feedback_data#Article_Feedback
[4] http://dx.doi.org/10.1145/2441776.2441872
[5] http://nitens.org/docs/cscw13.pdf
Hi
On the question of location of disputes I wrote a blog post a few years ago:
"Auray et al. identify several factors which contribute to conflictuality, such as the number of participants, the location of disputes, and the identity choices of participants. The larger the number of contributors, the more likely discussion is; the threshold number seems to be eight. When there are more than ten participants, discussion increasingly moves to the talk pages of users, and is more likely to degenerate into insults. A surefire indicator of fights are references to policy pages. These can be statistically measured: research by Kriplean and Beschastnikh has shown that pages with more than 250 posts had 51% of the links towards policy pages.
There are two main types of articles where conflicts erupt: first, the usual suspects are topics with burning current affairs value involving inter-ethnic or inter-faith conflicts; second, “scientific” categories with low academic legitimacy such as homeopathy and chiropraxy are strong conflict zones. Suspected “sock-puppetry” (fake identity) is also a source of conflict; an attenuated version of this being the lack of regard for people who have not registered on the site and instead just use an IP address: more than half of the text inserted by “IPs” is deleted, and they are more likely to be present in semi-protected articles which is where disputes and insults typically occur. IPs are also more likely to insult others, so there are suspicions that IPs are registereds users who use “socks” to engage in insulting behaviour which they would not dare to do under their registered identities."
http://blog.p2pfoundation.net/wikipedia-and-conflict/2009/07/07
cheers
Mathieu
________________________________________
From: wiki-research-l-bounces(a)lists.wikimedia.org <wiki-research-l-bounces(a)lists.wikimedia.org> on behalf of wiki-research-l-request(a)lists.wikimedia.org <wiki-research-l-request(a)lists.wikimedia.org>
Sent: Tuesday, December 16, 2014 23:01
To: wiki-research-l(a)lists.wikimedia.org
Subject: Wiki-research-l Digest, Vol 112, Issue 24
Send Wiki-research-l mailing list submissions to
wiki-research-l(a)lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
or, via email, send a message with subject or body 'help' to
wiki-research-l-request(a)lists.wikimedia.org
You can reach the person managing the list at
wiki-research-l-owner(a)lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wiki-research-l digest..."
Today's Topics:
1. Re: commentary on Wikipedia's community behaviour (Aaron gets
a quote) (mjn)
----------------------------------------------------------------------
Message: 1
Date: Tue, 16 Dec 2014 05:28:30 +0100
From: mjn <mjn(a)anadrome.org>
To: Research into Wikimedia content and communities
<wiki-research-l(a)lists.wikimedia.org>
Subject: Re: [Wiki-research-l] commentary on Wikipedia's community
behaviour (Aaron gets a quote)
Message-ID: <87k31si55a.fsf(a)mjn.anadrome.org>
Content-Type: text/plain; charset=utf-8
Perhaps it depends on what part of the encyclopedia? Has anyone
attempted to characterize how the editing environment varies with
different subject matter? I often run across descriptions that don't
comport with either my experience, or that of people I've interviewed,
but it's hard to tell precisely why. I've encountered quite different
beliefs about what the en.wikipedia community is like, even among people
who to me seem to otherwise have a similar background.
Entirely anecdotally, areas of interest seem to be one correlated
factor. For example, writing an article on an archaeological site (one
thing I've mentored new editors in doing) is by and large trouble-free
and friendly, in my experience. But some other areas are not. I haven't
attempted to characterize that factor in any detail.
-Mark
WereSpielChequers <werespielchequers(a)gmail.com> writes:
> We have problems, I don't dispute that. But "ugly and bitter as 4chan"? That has to be an exaggeration.
>
> Regards
>
> Jonathan Cardy
>
>
>> On 13 Dec 2014, at 01:03, Andrew Lih <andrew.lih(a)gmail.com> wrote:
>>
>> I certainly hope you're right Sydney. What a horrible mess.
>>
>>
>>> On Fri, Dec 12, 2014 at 5:53 PM, Sydney Poore <sydney.poore(a)gmail.com> wrote:
>>> I think feminists, especially those who take an interest in STEM, will pass this article around.
>>>
>>> Sydney
>>>
>>>> On Dec 12, 2014 5:35 PM, "Andrew Lih" <andrew.lih(a)gmail.com> wrote:
>>>> It's a good piece, but honestly I think only the dedicated tech reader will make it through the entire story. There's a lot of jargon and insider intrigue such that I could imagine most people never making past the typewriter barf of "BLP, AGF, NOR" :)
>>>>
>>>>
>>>>> On Fri, Dec 12, 2014 at 5:26 PM, Dariusz Jemielniak <darekj(a)alk.edu.pl> wrote:
>>>>> While I agree that the article is overly negative (likely because of the individual experience), I think it still points to an important problem. I don't perceive this article as really problematic in terms of image. Maybe naively, I imagine that people will not stop donating because the community is not ideal.
>>>>>
>>>>> pundit
>>>>>
>>>>>> On Fri, Dec 12, 2014 at 11:16 PM, Kerry Raymond <kerry.raymond(a)gmail.com> wrote:
>>>>>> There’s a saying that everyone likes to eat sausages but nobody likes to know how they are made. It is not good to have negative publicity like that during the annual donation campaign (irrespective of the motivations of the journalist and/or the rights/wrongs of the issue being reported, neither of which I intend to debate here). As a donation-funded organisation, public perception matters a lot.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kerry
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: Jonathan Morgan [mailto:jmorgan@wikimedia.org]
>>>>>> Sent: Saturday, 13 December 2014 6:43 AM
>>>>>> To: Research into Wikimedia content and communities
>>>>>> Cc: Kerry Raymond
>>>>>> Subject: Re: [Wiki-research-l] commentary on Wikipedia's community behaviour (Aaron gets a quote)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I mostly agree. On one hand, it's always nice to see a detailed description of how wiki-sausage gets made in a major venue. On the other, this journalist clearly has a personal axe to grind, and used his bully pulpit to grind it in public.
>>>>>>
>>>>>>
>>>>>>
>>>>>> - J
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 12, 2014 at 1:39 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
>>>>>>
>>>>>> 1000th addition to the inconsequential rant genre.
>>>>>>
>>>>>> Nemo
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wiki-research-l mailing list
>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Jonathan T. Morgan
>>>>>>
>>>>>> Community Research Lead
>>>>>>
>>>>>> Wikimedia Foundation
>>>>>>
>>>>>> User:Jmorgan (WMF)
>>>>>>
>>>>>> jmorgan(a)wikimedia.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wiki-research-l mailing list
>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> __________________________
>>>>> prof. dr hab. Dariusz Jemielniak
>>>>> kierownik katedry Zarządzania Międzynarodowego
>>>>> i centrum badawczego CROW
>>>>> Akademia Leona Koźmińskiego
>>>>> http://www.crow.alk.edu.pl
>>>>>
>>>>> członek Akademii Młodych Uczonych Polskiej Akademii Nauk
>>>>> członek Komitetu Polityki Naukowej MNiSW
>>>>>
>>>>> Wyszła pierwsza na świecie etnografia Wikipedii "Common Knowledge? An Ethnography of Wikipedia" (2014, Stanford University Press) mojego autorstwa http://www.sup.org/book.cgi?id=24010
>>>>>
>>>>> Recenzje
>>>>> Forbes: http://www.forbes.com/fdc/welcome_mjx.shtml
>>>>> Pacific Standard: http://www.psmag.com/navigation/books-and-culture/killed-wikipedia-93777/
>>>>> Motherboard: http://motherboard.vice.com/read/an-ethnography-of-wikipedia
>>>>> The Wikipedian: http://thewikipedian.net/2014/10/10/dariusz-jemielniak-common-knowledge
>>>>>
>>>>> _______________________________________________
>>>>> Wiki-research-l mailing list
>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Sent with my mu4e
------------------------------
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
End of Wiki-research-l Digest, Vol 112, Issue 24
************************************************
This month’s Research showcase will be held tomorrow, Thursday, Dec. 18th at 3PM PST (2300 UTC). As usual, the event will be recorded and publicly streamed on YouTube (link <https://www.youtube.com/watch?v=xPO8XhmeUAU>) We’ll hold a discussion and take questions from the Wikimedia Research IRC channel (#wikimedia-research <http://webchat.freenode.net/?channels=wikimedia-research> on freenode).
Looking forward to seeing you there.
Dario
——
This month:
Mobile Madness: The Changing Face of Wikimedia Readers
By Oliver Keyes <https://www.mediawiki.org/wiki/User:Ironholds>
A dive into the data we have around readership that investigates the rising popularity of the mobile web, countries and projects that are racing ahead of the pack, and what changes in user behaviour we can expect to see as mobile grows.
Global Disease Monitoring and Forecasting with Wikipedia
By Reid Priedhorsky <http://www.lanl.gov/expertise/profiles/view/reid-priedhorsky> (Los Alamos National Laboratory)
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r² up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.
http://www.slate.com/articles/technology/bitwise/2014/12/wikipedia_editing_d
isputes_the_crowdsourced_encyclopedia_has_become_a_rancorous.single.html
This is the predicated fallout of the recent ArbCom case in relation to
civility (although there's a rather longer and more tortuous history to it).
Kerry
Hey all,
Not sure if this would be interesting to researchers or community members,
but: you might remember a paper Stuart and Aaron did a while ago about
measuring edit sessions -
http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_Mea…
To me it's really interesting, because it's (as much as anything else) a
new metric for measuring participation, and a metric we can extract
additional metrics from (e.g., session length).
As part of some related work on /reader/ sessions, I wrote a pile of code
to handle session reconstruction. I've generalised it (it doesn't care if
you've got reader timestamps, editor timestamps, or best buy receipt
timestamps) and thrown it up at https://github.com/Ironholds/reconstructr .
I figure it could be useful to any researchers or community members looking
into sessions.
Thanks,
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hello Researchers,
I've been playing with Recent Changes Stream Interface
<https://wikitech.wikimedia.org/wiki/RCStream> recently, and have started
trying to use the API's "*action=compare*" to look at every diff of every
wiki in real time. The goal is to produce real-time analytics on the
content that's being added or deleted. The only problem is that is will
really hammer the API with lots of reads since it doesn't have a batch
interface. Can I spawn multiple network threads and do 10+ reads per second
forever without the API complaining? Can I warn someone about this and get
a special exemption for research purposes?
The other thing to do would be to use "*action=query*" to get the revisions
in batches and do the diffing myself, but then i'm not guaranteed to be
diffing in the same way that the site is.
What techniques would you recommend?
Make a great day,
Max Klein ‽ http://notconfusing.com/