Two new datasets for understanding Knowledge Integrity in Wikipedia - Wiki-research-l

7 Jun 2021

Hello everybody,

Within the context of the Knowledge Integrity program
<https://research.wikimedia.org/knowledge-integrity.html>, the Research
Team (and our formal collaborators
<https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations>)
has been working on releasing relevant datasets on this area.

Recently we have published the following datasets:

   -

   Tracking Knowledge Propagation Across Wikipedia Languages: A dataset of
   inter-language knowledge propagation in Wikipedia. Covering the entire 309
   language editions and 33M articles, the dataset aims to track the full
   propagation history of Wikipedia concepts, and allow follow up research on
   building predictive models of them. For this purpose, we align all the
   Wikipedia articles in a language-agnostic manner according to the concept
   they cover, their topic, and the timestamp of each article creation,  which
   results in 13M propagation instances. (paper
   <https://arxiv.org/abs/2103.16613>, dataset
   <https://zenodo.org/record/4433137>, code
   <https://github.com/rodolfovalentim/wikipedia-content-propagation>, meta

<https://meta.wikimedia.org/wiki/Research:Exploration_on_content_propagation_across_Wikimedia_projects>
   )

   -

   Wiki-Reliability: A Large Scale Dataset for Content Reliability on
   (English) Wikipedia: We selected the 10 most popular reliability-related
   templates on English Wikipedia, and propose an effective method to label
   almost 1M samples of Wikipedia article revisions as positive or negative
   with respect to each template. Each positive/negative example in the
   dataset comes with the full article text and 20 features from the
   revision's metadata (paper <https://arxiv.org/abs/2105.04117>, dataset

<https://figshare.com/articles/dataset/Wiki-Reliability_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia/14113799>,
   code <https://github.com/kay-wong/Wiki-Reliability/>, meta

<https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia>
   ).

We hope that these datasets can be used by the research community to keep
working on understanding and modeling knowledge integrity in Wikipedia.

Currently we are working on expanding both datasets. For knowledge
propagation, we are characterizing the different types of cascades, and
generating new prediction models. For the Wiki-Reliability dataset, we are
currently working on expanding this to more languages.

If you have any questions about these datasets or related projects please
feel free to contact me.

Best,

--

Diego Sáez Trumper

Senior Research Scientist

Wikimedia Foundation.