[Xmldatadumps-l] inter-page links in the data dump
Greg Morrison
gmorriso at seas.harvard.edu
Fri Sep 30 15:58:43 UTC 2011
I am interested in looking at the links between webpages on wikipedia
for scientific research. I have been to
http://en.wikipedia.org/wiki/Wikipedia:Database_download
which suggested that the latest pages-articles is likely the one
people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information. Is there a description of what each file
contains?
(2) The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB. Is it correct that this contains the current snapshot of all
pages and articles in wikipedia? (I only ask because this seems
small)
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link. It would appear that links are
denoted by [[link]] or [[link | word]]. Such patterns would be fairly
easy to find using perl. However, I've noticed some odd cases, such
as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
first to formulate ...... in his
work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file? I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!
More information about the Xmldatadumps-l
mailing list