[Wiki-research-l] Annoucing WikiConv dataset

21 Jan 2019

Hi all,

We’re thrilled to announce the release of WikiConv—a multilingual corpus
reconstructing the complete conversational history of multiple Wikipedia
language editions
<https://github.com/conversationai/wikidetox/tree/master/wikiconv>.

The corpus—a collaboration between Jigsaw, Cornell and Wikimedia
foundation—includes  over 100M individual conversation threads and 300M
conversational actions extracted from the English, Chinese, German, Greek,
and Russian Wikipedia talk pages.

WikiConv can be used to understand and model conversational turns in online
collaborative spaces, as we showed in an earlier study, predicting when
conversations go awry <https://arxiv.org/abs/1805.05345>.

The reconstruction methodology, as well as its possible applications, are
described in a paper by Hua et al. recently presented at EMNLP 2018
<https://arxiv.org/abs/1810.13181>. You can also watch a video presentation
of this work from the Wikimedia Research showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018> in
June 2018.

The corpus is released under CC0 (CC BY SA for individual comments). All
the underlying code is available in this Github repository
<https://github.com/conversationai/wikidetox/tree/master/wikiconv>.

If you have any questions about the dataset, feel free to contact us at
yiqing(a)cs.cornell.edu.

Best,

Yiqing

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Annoucing WikiConv dataset