I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
Thanks for your help!
Hi,
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains?
A short description of each file is at the dated version of the page (the latest right now is http://dumps.wikimedia.org/enwiki/20110901/).
(2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small)
It does contain all articles in the English Wikipedia. But it doesn't contain all pages. For example, talk pages and user pages are missing from it.
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
The file format is the same as when you edit the article. That means finding normal links is not as simple. And you won't find links contained in templates this way (which you may or may not want). If you want to get all page to page links, you can download the pagelinks.sql.gz file. Although it's not XML, but a dump of a MySQL table.
Petr Onderka [[User:Svick]]
Hi all,
Thanks for the help! The pagelink.sql file sounds like exactly what I need. I've downloaded it and expanded it, and am trying to sort out the contents. Sorry to be dense here! It appears that it's a sequence of INSERT INTO `pagelinks` VALUES (n1,n2,name), where none of these values are unique. What do these elements correspond to?
For a quick test, the randomly chosen "Bely_Iyus_River" occurs three times in the sql file, but the wiki page for Bely_Iyus_River has at least 5 outgoing links. So my guess is that the pagelinks table element (n1,n2,name) corresponds to an incoming link to the file name. Where's that link coming from, though? How do I use n1 and n2 to find the source?
Thanks again!
I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
Thanks for your help!
Greg Morrison, 03/10/2011 23:18:
Thanks for the help! The pagelink.sql file sounds like exactly what I need. I've downloaded it and expanded it, and am trying to sort out the contents.
I don't know anything about this, but that's just a SQL dump of the table, so I hope the documentation here is relevant: http://www.mediawiki.org/wiki/Manual:Pagelinks_table
Nemo
On 03/10/11 23:18, Greg Morrison wrote:
Hi all,
Thanks for the help! The pagelink.sql file sounds like exactly what I need. I've downloaded it and expanded it, and am trying to sort out the contents. Sorry to be dense here! It appears that it's a sequence of INSERT INTO `pagelinks` VALUES (n1,n2,name), where none of these values are unique. What do these elements correspond to?
For a quick test, the randomly chosen "Bely_Iyus_River" occurs three times in the sql file, but the wiki page for Bely_Iyus_River has at least 5 outgoing links. So my guess is that the pagelinks table element (n1,n2,name) corresponds to an incoming link to the file name. Where's that link coming from, though? How do I use n1 and n2 to find the source?
Thanks again!
You probably already found out by yourself, but just in case, and keeping for the record:
The pages are identified in the database by a tuple (namespace, title) So (0, 'Foo') is the article [[Foo]] but (1, 'Foo') is [[Talk:Foo]].
In your above question, (n2, name) is the article the link *points to* (ie. an outgoing link to a -maybe missing- page). n1 is the page_id of the page with that link. You need the page.sql file to find out which (page_namespace, page_title) does n1 correspond to.
If you use the file pagelinks.sql.gz you'll get links without having to search through all text of wikipedia.
Relationships can you find here http://www.mediawiki.org/wiki/File:MediaWiki_database_schema_1-17_%28r82044%...
BR,
/Fluff http://download.wikimedia.org/svwiki/20110920/svwiki-20110920-pagelinks.sql.gz On Fri, Sep 30, 2011 at 5:58 PM, Greg Morrison gmorriso@seas.harvard.eduwrote:
I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content.
Thanks for your help!
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org