Cross posting this call to action from the Analytics list. The Data Lake data sets may be of interest to some tool builders here. This is not a "real time" data set that would be good for patrolling workflows, but it might be an interesting source of data for deeper analysis of how articles have changed over time. Take a look at the various links to Wikitech for more details on what data is in the collection and how it is prepared.
If you have more questions I would encourage you to subscribe to the analytics@lists.wikimedia.org list and discuss there to avoid Leila and others having their good answers kept from the larger Analytics and Research communities that this data set is initially aimed at serving.
Bryan
---------- Forwarded message --------- From: Leila Zia leila@wikimedia.org Date: Tue, Aug 27, 2019 at 9:47 AM Subject: [Analytics] [Input requested] Data Lake Edit release input request To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. analytics@lists.wikimedia.org
[apologies for cross-posting]
In a nutshell: We are asking for your input to help us learn how to release the historical edit data of Wikimedia projects in a more efficient way. Please provide your feedback via https://docs.google.com/forms/d/e/1FAIpQLScc15eSeFrVvAh_ydpX_1p0v6-WSx2qe3Ec... by 2019-09-03.
****** Dear researchers and analytics users,
The Analytics team at Wikimedia Foundation [1] has been working on building a data lake [2] for Wikimedia edits [3] to enable the research and analysis of Wikimedia's edit data in a more efficient way. This data is a history of activity on Wikimedia projects as complete and research-friendly as possible. Edits have context, such as whether they were reverted, in the same line as the edit itself. So you can focus more on what you want to find out instead of writing code to wrestle the data. Each line of the data released will include the following and more (see full specification [3a], [3b], [3c]): * editor edit count, groups, blocks, bot status, name, current and historical (time of edit) * seconds since this editor's last edit * page context, current and historical (namespace, seconds since last revision, etc.) * seconds to identity revert or deletion, if applicable * revision tags (mobile edit, ve edit, etc.)
The first instance of this data will be released in the coming months and to make this release as useful as possible for you all, the users of the data, the team needs to hear your thoughts on how to slice and dice the data at publishing time. You can provide your input at https://docs.google.com/forms/d/e/1FAIpQLScc15eSeFrVvAh_ydpX_1p0v6-WSx2qe3Ec... .
Please provide your input to this survey no later than 2019-09-03.
Best, Leila
[1] https://wikitech.wikimedia.org/wiki/Analytics [2] https://en.wikipedia.org/wiki/Data_lake [3] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits a) https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... b) https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_user... c) https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page...
-- Leila Zia Principal Research Scientist, Head of Research Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics