*TL;DR*
We want to get researchers in a room to experiment with infrastructure for
making open data science easier. We're focusing on three infrastructural
strategies (1) improving metadata and indexing open online community
datasets, (2) an online querying service that makes processing, joining,
and extracting subsets of data easier and (3) defining a protocol for
reporting research methods that will make studies easier to
replicate/extend.
*Title:* Breaking into new Data-Spaces: Infrastructure for Open Community
Science
*Date:* February 27, 2016
*Application deadline:* December 31, 2015
*Conference website:* http://cscw.acm.org/2016/program/workshops.php#WP-10
*Apply/info:*
https://meta.wikimedia.org/wiki/Research:Breaking_into_new_Data-Spaces
*Participants announced:* January 15, 2016
We encourage you to apply
<https://wikimedia.qualtrics.com/SE/?SID=SV_2bCdc2BGBGAWwmx> to a CSCW 2016
<http://cscw.acm.org/2016/> workshop focused on advancing your ability to
do work with datasets from online communities. We will experiment with
documentation protocols and technologies that are designed to make the
process of “breaking into” a new dataset more tractable for researchers
studying open online communities.
*Who can participate*
Anyone who builds, manages, studies or is interested in studying open
online communities can apply. Fill out our application form and tell us a
bit about your relevant interests and experience.
*Organizers*
Aaron Halfaker, Jonathan Morgan, Yuvaraj Pandian - Wikimedia Foundation
Elizabeth Thiry - Boundless
Kristen Schuster, A.J. Million, Sean Goggins - University of Missouri
William Rand - University of Maryland
David Laniado - Eurecat
*Abstract*
Despite being easily accessible, open online community (OOC) data can be
difficult to use effectively. In order to access and analyze large amounts
of data, researchers must first become familiar with the meaning of data
values. Then they must find a way to obtain and process the datasets to
extract their desired vectors of behavior and content. This process is
fraught with problems that are solved (through great difficulty) over and
over again by each research team/lab that breaks into datasets for a new
OOC.
In this workshop, we'll experiment with documentation protocols and
technologies that are designed to make the process of “breaking into” a new
dataset more tractable for researchers studying open online communities.
This workshop’s purpose is to bring together researchers to test these
systems and discover problems and missed opportunities to support
iteration. Participants will also be given the opportunity to use
state-of-the-art documentation and technologies to break into a new
collection of datasets. This workshop is the direct result of a call to
action to build infrastructure for data sharing between researchers from
past CSCW workshops and related conferences.
For more information and to apply see:
https://meta.wikimedia.org/wiki/Research:Breaking_into_new_Data-Spaces
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, November
18, 2015 at 11:30 (PST).
YouTube stream: http://www.youtube.com/watch?v=kXCI6whgdUA
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Archive>.
We look forward to seeing you!
Kind regards,
Sarah R. Rodlund
Project Coordinator-Engineering, Wikimedia Foundation
srodlund(a)wikimedia.org
This month:
*Impact, Characteristics, and Detection of Wikipedia Hoaxes*
By Srijan Kumar
False information on Wikipedia raises concerns about its credibility. One
way in which false information may be presented on Wikipedia is in the form
of hoax articles, i.e. articles containing fabricated facts about
nonexistent entities or events. In this talk, we study false information on
Wikipedia by focusing on the hoax articles that have been created
throughout its history. First, we assess the real-world impact of hoax
articles by measuring how long they survive before being debunked, how many
pageviews they receive, and how heavily they are referred to by documents
on the Web. We find that, while most hoaxes are detected quickly and have
little impact on Wikipedia, a small number of hoaxes survive long and are
well cited across the Web. Second, we characterize the nature of successful
hoaxes by comparing them to legitimate articles and to failed hoaxes that
were discovered shortly after being created. We find characteristic
differences in terms of article structure and content, embeddedness into
the rest of Wikipedia, and features of the editor who created the hoax.
Third, we successfully apply our findings to address a series of
classification tasks, most notably to determine whether a given article is
a hoax. And finally, we describe and evaluate a task involving humans
distinguishing hoaxes from non-hoaxes. We find that humans are not
particularly good at the task and that our automated classifier outperforms
them by a big margin.
Agree. A great step forwards for all of us who do outreach. Many thanks to
everyone who made this happen :-)
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
As of July 2015 I am a board member of the Wikimedia Foundation
My emails; however, do not represent the official position of the WMF
Shall do! I'm already linking in the internal documentation :)
On 17 November 2015 at 21:11, Madhumitha Viswanathan
<mviswanathan(a)wikimedia.org> wrote:
> Woot! Nice :) Would be cool to link to the API docs from your README too.
>
> On Tue, Nov 17, 2015 at 5:54 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
>>
>> Hey!
>>
>> As y'all may have seen, we have a new pageviews API, with much finer
>> granularity and better recall than the existing data. Since I had
>> advance notice of the release, I was able to put together an R client
>> already - you can get it at https://github.com/Ironholds/pageviews if
>> R is your language of choice, and it'll be up on CRAN shortly.
>>
>> Thanks,
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> --
> --Madhu :)
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
--
Oliver Keyes
Count Logula
Wikimedia Foundation
Hey!
As y'all may have seen, we have a new pageviews API, with much finer
granularity and better recall than the existing data. Since I had
advance notice of the release, I was able to put together an R client
already - you can get it at https://github.com/Ironholds/pageviews if
R is your language of choice, and it'll be up on CRAN shortly.
Thanks,
--
Oliver Keyes
Count Logula
Wikimedia Foundation
+research
Fascinating. Thanks for sharing this, Nemo. And for setting those arrogant
Stackers straight ;)
For anyone else interested: Nemo was able to answer this question because
StackExchange has a Quarry <http://quarry.wmflabs.org/>-like public query
interface of their own. You should go play with it right now:
http://data.stackexchange.com/
Jonathan
On Fri, Nov 13, 2015 at 10:56 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
> Some information at
> https://meta.stackexchange.com/questions/269334/how-many-active-users-contr…
>
> TL;DR: not really, and definitely not StackOverflow alone (~14k). But
> perhaps the whole StackExchange has more than the English Wikipedia alone.
>
> Nemo
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hi all,
I write this email on the public list hoping that the discussion could
be of interest for more people.
I am working with a student on scientific citation on Wikipedia and,
very simply put, we would like to use the pageview dataset to have a
rough measure of how many times a paper was viewed thanks to
Wikipedia.[*]
The full dataset is, as of now, ~ 4.7TB in size.
I have two questions:
* if we download this dataset this would entail, from a first
estimation, ~ 30 days of continuous download (assuming an average
download speed of ~ 2MB/s, which was what we measured over the
download of a month of data (~ 64GB)). Here at my University (Trento,
Italy) this kind of downloads have to be notified to the IT
department. I was wondering if this would be a useful information for
the WMF, too.
* given the estimation above I was wondering if it is possible to
obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of
a physical disk, I know that in some fields (e.g. neuroscience) this
is the standard way to exchange big dataset (in the order of TBs).
Thanks in advance for your help.
Cristian
[*] I know these are pageviews and not unique visitors, furthermore
there is no guarantee that viewing a citation means anything. I am
approaching to this data the same way "impressions" versus
"clicktroughs" are treated in the online advertising world.
[1] https://what-if.xkcd.com/31/