Annoucing WikiConv dataset - Wiki-research-l

22 Jan 2019


      Hi all,
We’re thrilled to announce the release of WikiConv—a multilingual corpus
reconstructing the complete conversational history of multiple Wikipedia
language editions
https://github.com/conversationai/wikidetox/tree/master/wikiconv.
The corpus—a collaboration between Jigsaw, Cornell and Wikimedia
foundation—includes  over 100M individual conversation threads and 300M
conversational actions extracted from the English, Chinese, German, Greek,
and Russian Wikipedia talk pages.
WikiConv can be used to understand and model conversational turns in online
collaborative spaces, as we showed in an earlier study, predicting when
conversations go awry https://arxiv.org/abs/1805.05345.
The reconstruction methodology, as well as its possible applications, are
described in a paper by Hua et al. recently presented at EMNLP 2018
https://arxiv.org/abs/1810.13181. You can also watch a video presentation
of this work from the Wikimedia Research showcase
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018 in
June 2018.
The corpus is released under CC0 (CC BY SA for individual comments). All
the underlying code is available in this Github repository
https://github.com/conversationai/wikidetox/tree/master/wikiconv.
If you have any questions about the dataset, feel free to contact us at
yiqing@cs.cornell.edu.
Best,
Yiqing