[Xmldatadumps-l] inter-page links in the data dump

Greg Morrison gmorriso at seas.harvard.edu
Mon Oct 3 21:18:40 UTC 2011


Hi all,

Thanks for the help!  The pagelink.sql file sounds like exactly what I
need.  I've downloaded it and expanded it, and am trying to sort out
the contents.  Sorry to be dense here!  It appears that it's a
sequence of INSERT INTO `pagelinks` VALUES (n1,n2,name), where none of
these values are unique.  What do these elements correspond to?

For a quick test, the randomly chosen "Bely_Iyus_River" occurs three
times in the sql file, but the wiki page for Bely_Iyus_River has at
least 5 outgoing links.  So my guess is that the pagelinks table
element (n1,n2,name) corresponds to an incoming link to the file name.
 Where's that link coming from, though?  How do I use n1 and n2 to
find the source?

Thanks again!


> I am interested in looking at the links between webpages on wikipedia
> for scientific research.  I have been to
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
> which suggested that the latest pages-articles is likely the one
> people want.  However, I'm unclear on some things.
>
> (1)  http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
> files, and I can't actually tell if one of them would actually contain
> only link information.  Is there a description of what each file
> contains?
> (2)  The enwiki-latest-pages-articles.xml file uncompresses as
> 31.55GB.  Is it correct that this contains the current snapshot of all
> pages and articles in wikipedia?  (I only ask because this seems
> small)
> (3)  If I am constrained to use latest-pages-articles.xml, I'm unclear
> on the method used to denote a link.  It would appear that links are
> denoted by [[link]] or [[link | word]].  Such patterns would be fairly
> easy to find using perl.  However, I've noticed some odd cases, such
> as
>
> "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
> first to formulate ...... in his
> work".<refname="EB1910" />]]"
>
> If I must search through the page-articles file, and if the [[ ]]
> notation is overloaded, is there a description of the patterns that
> are used in this file?  I.e. a way for me to ensure that I'm only
> grabbing links, not figure captions or some other content.
>
> Thanks for your help!
>



More information about the Xmldatadumps-l mailing list