Re: [Xmldatadumps-l] inter-page links in the data dump

7 Oct 2011


      If you use the file pagelinks.sql.gz you'll get links without having to
search through all text of wikipedia.
Relationships can you find here
http://www.mediawiki.org/wiki/File:MediaWiki_database_schema_1-17_%28r82044%...
BR,
/Fluff
http://download.wikimedia.org/svwiki/20110920/svwiki-20110920-pagelinks.sql.gz
On Fri, Sep 30, 2011 at 5:58 PM, Greg Morrison gmorriso@seas.harvard.eduwrote:
...
I am interested in looking at the links between webpages on wikipedia
for scientific research.  I have been to
http://en.wikipedia.org/wiki/Wikipedia:Database_download
which suggested that the latest pages-articles is likely the one
people want.  However, I'm unclear on some things.
(1)  http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information.  Is there a description of what each file
contains?
(2)  The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB.  Is it correct that this contains the current snapshot of all
pages and articles in wikipedia?  (I only ask because this seems
small)
(3)  If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link.  It would appear that links are
denoted by [[link]] or [[link | word]].  Such patterns would be fairly
easy to find using perl.  However, I've noticed some odd cases, such
as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], &quot;the
first to formulate ...... in his
work&quot;.&lt;refname=&quot;EB1910&quot; /&gt;]]"
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file?  I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!

Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] inter-page links in the data dump