I am a researcher, member of a project which aims at collecting
controversial scientific discussions which happened around a set of wiki
pages. Hence we want to start from these pages, collect their history
(various diff), discussions around these pages (including history of
discussions), and discussions pages of all authors who participated
(with history of these pages). After data collection, we will build a
structured corpus and launch analysis on these discussions.
But we faced a real problem when working on wiki dumps because it seems
that data are missing. Here are some details.
I used French wikipedia dump below:
"frwiki-20140208-pages-meta-history1.xml" (509 GB which has all pages
and history pages)
"frwiki-20140208-pages-meta-current.xml" (19 GB, which has current page
and current discussion page)
I was in trouble about "missing revision and missing text":
Starting with the article concerning the French word "Chiropratique" at
I found its history pages have 500+ pages, but in the
"frwiki-20140208-pages-meta-history1.xml", I extracted this page and
history pages contain only 6 revisions (see attached-file
"page-Chiropratique.xml"), which are not the most recent revisions. They
are the first six revisions.
Same problem for the user page "Utilisateur:Albi:n"
), its history pages have
9 revisions, but i found only 5 revisions in the
"frwiki-20140208-pages-meta-history1.xml". (see attached-file
I have another problem with "frwiki-20140208-pages-meta-current.xml". I
tried to extract "
this dump, i got last revision of course, but the page has missing text
(see Attached-file "page-Discussion:Apple.xml")
Are these data really missing from the dumps or did we miss something?
is there another better way to collected the data we are seeking?
Thank you in advance for your cooperation.
Laboratoire de Recherche sur le Langage (LRL)
Université Blaise Pascal (Clermont 2)
Tel : +33 3 4 73 34 68 35
Adresse: Université Blaise Pascal,
Maison des Sciences de l'Homme - LRL,
4 rue Ledru
63057 Clermont-Ferrand cedex 1