+1 to Leila. Really good suggestions re. making the dataset cite-able and providing an in-depth discussion of how it was produced. That's a lot of work, but it could produce a bunch of additional value.
Thanks for working on this, A-team. I wish I could transport it back to the past so I could use it to finish my dissertation faster!
On Tue, Feb 11, 2020 at 3:30 PM Giovanni Luca Ciampaglia glc3@mail.usf.edu wrote:
Hi Joseph,
Thanks a lot for creating and sharing such a valuable resource. I went through the schema and from what I understand there is no information about page-to-page links, correct? Are there any resources that would provide such historical data?
Best,
*Giovanni Luca Ciampaglia* ∙ glciampaglia.com Assistant Professor Computer Science and Engineering https://www.usf.edu/engineering/cse/ ∙ University of South Florida https://www.usf.edu/
*Due to Florida’s broad open records law, email to or from university employees is public record, available to the public and the media upon request.*
On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the
most
complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on
edits,
users and pages [5] [6].
We hope you will have as much fun playing with the data as we have
building
it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist...
[3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events. [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values,
and
are provided in a denormalized way to every event of the dataset. [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l