Hello everybody,
Within the context of the Knowledge Integrity program
<https://research.wikimedia.org/knowledge-integrity.html>, the Research
Team (and our formal collaborators
<https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations>)
has been working on releasing relevant datasets on this area.
Recently we have published the following datasets:
-
Tracking Knowledge Propagation Across Wikipedia Languages: A dataset of
inter-language knowledge propagation in Wikipedia. Covering the entire 309
language editions and 33M articles, the dataset aims to track the full
propagation history of Wikipedia concepts, and allow follow up research on
building predictive models of them. For this purpose, we align all the
Wikipedia articles in a language-agnostic manner according to the concept
they cover, their topic, and the timestamp of each article creation, which
results in 13M propagation instances. (paper
<https://arxiv.org/abs/2103.16613>, dataset
<https://zenodo.org/record/4433137>, code
<https://github.com/rodolfovalentim/wikipedia-content-propagation>, meta
<https://meta.wikimedia.org/wiki/Research:Exploration_on_content_propagation_across_Wikimedia_projects>
)
-
Wiki-Reliability: A Large Scale Dataset for Content Reliability on
(English) Wikipedia: We selected the 10 most popular reliability-related
templates on English Wikipedia, and propose an effective method to label
almost 1M samples of Wikipedia article revisions as positive or negative
with respect to each template. Each positive/negative example in the
dataset comes with the full article text and 20 features from the
revision's metadata (paper <https://arxiv.org/abs/2105.04117>, dataset
<https://figshare.com/articles/dataset/Wiki-Reliability_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia/14113799>,
code <https://github.com/kay-wong/Wiki-Reliability/>, meta
<https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia>
).
We hope that these datasets can be used by the research community to keep
working on understanding and modeling knowledge integrity in Wikipedia.
Currently we are working on expanding both datasets. For knowledge
propagation, we are characterizing the different types of cascades, and
generating new prediction models. For the Wiki-Reliability dataset, we are
currently working on expanding this to more languages.
If you have any questions about these datasets or related projects please
feel free to contact me.
Best,
--
Diego Sáez Trumper
Senior Research Scientist
Wikimedia Foundation.
Show replies by date