Hello,
I am a researcher, member of a project which aims at collecting
controversial scientific discussions which happened around a set of
wiki pages. Hence we want to start from these pages, collect their
history (various diff), discussions around these pages (including
history of discussions), and discussions pages of all authors who
participated (with history of these pages). After data collection,
we will build a structured corpus and launch analysis on these
discussions.
But we faced a real problem when working on wiki dumps because it
seems that data are missing. Here are some details.
I used French wikipedia dump below:
"frwiki-20140208-pages-meta-history1.xml" (509 GB which has all
pages and history pages)
"frwiki-20140208-pages-meta-current.xml" (19 GB, which has current
page and current discussion page)
I was in trouble about "missing revision and missing text":
Missing revision
Starting with the article concerning the French word "Chiropratique"
at
http://fr.wikipedia.org/wiki/Chiropratique
I found its history pages have 500+ pages, but in the
"frwiki-20140208-pages-meta-history1.xml", I extracted this page and
history pages contain only 6 revisions (see attached-file
"page-Chiropratique.xml"), which are not the most recent revisions.
They are the first six revisions.
Same problem for the user page "Utilisateur:Albi:n"
(http://fr.wikipedia.org/wiki/Utilisateur:Albin), its history pages
have 9 revisions, but i found only 5 revisions in the
"frwiki-20140208-pages-meta-history1.xml". (see attached-file
"page-Utilisateur:Albin.xml").
Missing text
I have another problem with
"frwiki-20140208-pages-meta-current.xml". I tried to extract "
Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In
this dump, i got last revision of course, but the page has missing
text (see Attached-file "page-Discussion:Apple.xml")
Are these data really missing from the dumps or did we miss
something?
is there another better way to collected the data we are seeking?
Thank you in advance for your cooperation.
--
Kun JIN
Laboratoire de Recherche sur le Langage
(LRL)
Université Blaise Pascal (Clermont 2)
kun.jin@univ-bpclermont.fr
Tel : +33 3 4 73 34 68 35
Adresse: Université Blaise Pascal,
Maison des Sciences de l'Homme - LRL,
4 rue Ledru
63057 Clermont-Ferrand cedex 1