Hello everybody,
Within the context of the Knowledge Integrity program https://research.wikimedia.org/knowledge-integrity.html, the Research Team (and our formal collaborators https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations) has been working on releasing relevant datasets on this area.
Recently we have published the following datasets:
-
Tracking Knowledge Propagation Across Wikipedia Languages: A dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, their topic, and the timestamp of each article creation, which results in 13M propagation instances. (paper https://arxiv.org/abs/2103.16613, dataset https://zenodo.org/record/4433137, code https://github.com/rodolfovalentim/wikipedia-content-propagation, meta https://meta.wikimedia.org/wiki/Research:Exploration_on_content_propagation_across_Wikimedia_projects )
-
Wiki-Reliability: A Large Scale Dataset for Content Reliability on (English) Wikipedia: We selected the 10 most popular reliability-related templates on English Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata (paper https://arxiv.org/abs/2105.04117, dataset https://figshare.com/articles/dataset/Wiki-Reliability_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia/14113799, code https://github.com/kay-wong/Wiki-Reliability/, meta https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia ).
We hope that these datasets can be used by the research community to keep working on understanding and modeling knowledge integrity in Wikipedia.
Currently we are working on expanding both datasets. For knowledge propagation, we are characterizing the different types of cascades, and generating new prediction models. For the Wiki-Reliability dataset, we are currently working on expanding this to more languages.
If you have any questions about these datasets or related projects please feel free to contact me.
Best,
--
Diego Sáez Trumper
Senior Research Scientist
Wikimedia Foundation.
wiki-research-l@lists.wikimedia.org