Cross posting this call to action from the Analytics list. The Data
Lake data sets may be of interest to some tool builders here. This is
not a "real time" data set that would be good for patrolling
workflows, but it might be an interesting source of data for deeper
analysis of how articles have changed over time. Take a look at the
various links to Wikitech for more details on what data is in the
collection and how it is prepared.
If you have more questions I would encourage you to subscribe to the
<analytics(a)lists.wikimedia.org> list and discuss there to avoid Leila
and others having their good answers kept from the larger Analytics
and Research communities that this data set is initially aimed at
serving.
Bryan
---------- Forwarded message ---------
From: Leila Zia <leila(a)wikimedia.org>
Date: Tue, Aug 27, 2019 at 9:47 AM
Subject: [Analytics] [Input requested] Data Lake Edit release input request
To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
<analytics(a)lists.wikimedia.org>
[apologies for cross-posting]
In a nutshell:
We are asking for your input to help us learn how to release the
historical edit data of Wikimedia projects in a more efficient way.
Please provide your feedback via
https://docs.google.com/forms/d/e/1FAIpQLScc15eSeFrVvAh_ydpX_1p0v6-WSx2qe3E…
by 2019-09-03.
******
Dear researchers and analytics users,
The Analytics team at Wikimedia Foundation [1] has been working on
building a data lake [2] for Wikimedia edits [3] to enable the
research and analysis of Wikimedia's edit data in a more efficient
way. This data is a history of activity on Wikimedia projects as
complete and research-friendly as possible. Edits have context, such
as whether they were reverted, in the same line as the edit itself. So
you can focus more on what you want to find out instead of writing
code to wrestle the data. Each line of the data released will include
the following and more (see full specification [3a], [3b], [3c]):
* editor edit count, groups, blocks, bot status, name, current and
historical (time of edit)
* seconds since this editor's last edit
* page context, current and historical (namespace, seconds since last
revision, etc.)
* seconds to identity revert or deletion, if applicable
* revision tags (mobile edit, ve edit, etc.)
The first instance of this data will be released in the coming months
and to make this release as useful as possible for you all, the users
of the data, the team needs to hear your thoughts on how to slice and
dice the data at publishing time. You can provide your input at
https://docs.google.com/forms/d/e/1FAIpQLScc15eSeFrVvAh_ydpX_1p0v6-WSx2qe3E…
.
Please provide your input to this survey no later than 2019-09-03.
Best,
Leila
[1]
https://wikitech.wikimedia.org/wiki/Analytics
[2]
https://en.wikipedia.org/wiki/Data_lake
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits
a)
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_his…
b)
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_use…
c)
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_pag…
--
Leila Zia
Principal Research Scientist, Head of Research
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808