Re: [Xmldatadumps-l] inter-page links in the data dump

3 Oct 2011


      Hi,
...
(1)  http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information.  Is there a description of what each file
contains?
A short description of each file is at the dated version of the page
(the latest right now is http://dumps.wikimedia.org/enwiki/20110901/).
...
(2)  The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB.  Is it correct that this contains the current snapshot of all
pages and articles in wikipedia?  (I only ask because this seems
small)
It does contain all articles in the English Wikipedia. But it doesn't
contain all pages. For example, talk pages and user pages are missing
from it.
...
(3)  If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link.  It would appear that links are
denoted by [[link]] or [[link | word]].  Such patterns would be fairly
easy to find using perl.  However, I've noticed some odd cases, such
as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], &quot;the
first to formulate ...... in his
work&quot;.&lt;refname=&quot;EB1910&quot; /&gt;]]"
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file?  I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
The file format is the same as when you edit the article. That means
finding normal links is not as simple. And you won't find links
contained in templates this way (which you may or may not want). If
you want to get all page to page links, you can download the
pagelinks.sql.gz file. Although it's not XML, but a dump of a MySQL
table.
Petr Onderka
[[User:Svick]]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] inter-page links in the data dump