Hi all,
Thanks for the help! The pagelink.sql file sounds like exactly what I need. I've downloaded it and expanded it, and am trying to sort out the contents. Sorry to be dense here! It appears that it's a sequence of INSERT INTO `pagelinks` VALUES (n1,n2,name), where none of these values are unique. What do these elements correspond to?
For a quick test, the randomly chosen "Bely_Iyus_River" occurs three times in the sql file, but the wiki page for Bely_Iyus_River has at least 5 outgoing links. So my guess is that the pagelinks table element (n1,n2,name) corresponds to an incoming link to the file name. Where's that link coming from, though? How do I use n1 and n2 to find the source?
Thanks again!
I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
Thanks for your help!