Hi all,
We’re thrilled to announce the release of WikiConv—a multilingual corpus reconstructing the complete conversational history of multiple Wikipedia language editions https://github.com/conversationai/wikidetox/tree/master/wikiconv.
The corpus—a collaboration between Jigsaw, Cornell and Wikimedia foundation—includes over 100M individual conversation threads and 300M conversational actions extracted from the English, Chinese, German, Greek, and Russian Wikipedia talk pages.
WikiConv can be used to understand and model conversational turns in online collaborative spaces, as we showed in an earlier study, predicting when conversations go awry https://arxiv.org/abs/1805.05345.
The reconstruction methodology, as well as its possible applications, are described in a paper by Hua et al. recently presented at EMNLP 2018 https://arxiv.org/abs/1810.13181. You can also watch a video presentation of this work from the Wikimedia Research showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018 in June 2018.
The corpus is released under CC0 (CC BY SA for individual comments). All the underlying code is available in this Github repository https://github.com/conversationai/wikidetox/tree/master/wikiconv.
If you have any questions about the dataset, feel free to contact us at yiqing@cs.cornell.edu.
Best,
Yiqing
wiki-research-l@lists.wikimedia.org