Hi,
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains?
A short description of each file is at the dated version of the page (the latest right now is http://dumps.wikimedia.org/enwiki/20110901/).
(2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small)
It does contain all articles in the English Wikipedia. But it doesn't contain all pages. For example, talk pages and user pages are missing from it.
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
The file format is the same as when you edit the article. That means finding normal links is not as simple. And you won't find links contained in templates this way (which you may or may not want). If you want to get all page to page links, you can download the pagelinks.sql.gz file. Although it's not XML, but a dump of a MySQL table.
Petr Onderka [[User:Svick]]