Hello,
I am a researcher, member of a project which aims at collecting controversial scientific discussions which happened around a set of wiki pages. Hence we want to start from these pages, collect their history (various diff), discussions around these pages (including history of discussions), and discussions pages of all authors who participated (with history of these pages). After data collection, we will build a structured corpus and launch analysis on these discussions.
But we faced a real problem when working on wiki dumps because it seems that data are missing. Here are some details.
I used French wikipedia dump below:
"frwiki-20140208-pages-meta-history1.xml" (509 GB which has all pages and history pages) "frwiki-20140208-pages-meta-current.xml" (19 GB, which has current page and current discussion page)
I was in trouble about "missing revision and missing text":
*Missing revision* Starting with the article concerning the French word "Chiropratique" at http://fr.wikipedia.org/wiki/Chiropratique I found its history pages have 500+ pages, but in the "frwiki-20140208-pages-meta-history1.xml", I extracted this page and history pages contain only 6 revisions (see attached-file "page-Chiropratique.xml"), which are not the most recent revisions. They are the first six revisions.
Same problem for the user page "Utilisateur:Albi:n" (http://fr.wikipedia.org/wiki/Utilisateur:Albin), its history pages have 9 revisions, but i found only 5 revisions in the "frwiki-20140208-pages-meta-history1.xml". (see attached-file "page-Utilisateur:Albin.xml").
*Missing text* I have another problem with "frwiki-20140208-pages-meta-current.xml". I tried to extract " Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In this dump, i got last revision of course, but the page has missing text (see Attached-file "page-Discussion:Apple.xml")
Are these data really missing from the dumps or did we miss something? is there another better way to collected the data we are seeking?
Thank you in advance for your cooperation.
-- Kun JIN Laboratoire de Recherche sur le Langage (LRL) Université Blaise Pascal (Clermont 2) kun.jin@univ-bpclermont.fr Tel : +33 3 4 73 34 68 35
Adresse: Université Blaise Pascal, Maison des Sciences de l'Homme - LRL, 4 rue Ledru 63057 Clermont-Ferrand cedex 1