Hi all,
We’re thrilled to announce the release of WikiConv—a multilingual corpus
reconstructing the complete conversational history of multiple Wikipedia
language editions
<https://github.com/conversationai/wikidetox/tree/master/wikiconv>.
The corpus—a collaboration between Jigsaw, Cornell and Wikimedia
foundation—includes over 100M individual conversation threads and 300M
conversational actions extracted from the English, Chinese, German, Greek,
and Russian Wikipedia talk pages.
WikiConv can be used to understand and model conversational turns in online
collaborative spaces, as we showed in an earlier study, predicting when
conversations go awry <https://arxiv.org/abs/1805.05345>.
The reconstruction methodology, as well as its possible applications, are
described in a paper by Hua et al. recently presented at EMNLP 2018
<https://arxiv.org/abs/1810.13181>. You can also watch a video presentation
of this work from the Wikimedia Research showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018> in
June 2018.
The corpus is released under CC0 (CC BY SA for individual comments). All
the underlying code is available in this Github repository
<https://github.com/conversationai/wikidetox/tree/master/wikiconv>.
If you have any questions about the dataset, feel free to contact us at
yiqing(a)cs.cornell.edu.
Best,
Yiqing