[Xmldatadumps-l] inter-page links in the data dump
Fluff Wikipedia
fluff.svwp at gmail.com
Fri Oct 7 10:25:24 UTC 2011
If you use the file pagelinks.sql.gz you'll get links without having to
search through all text of wikipedia.
Relationships can you find here
http://www.mediawiki.org/wiki/File:MediaWiki_database_schema_1-17_%28r82044%29.png
BR,
/Fluff
<http://download.wikimedia.org/svwiki/20110920/svwiki-20110920-pagelinks.sql.gz>
On Fri, Sep 30, 2011 at 5:58 PM, Greg Morrison <gmorriso at seas.harvard.edu>wrote:
> I am interested in looking at the links between webpages on wikipedia
> for scientific research. I have been to
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
> which suggested that the latest pages-articles is likely the one
> people want. However, I'm unclear on some things.
>
> (1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
> files, and I can't actually tell if one of them would actually contain
> only link information. Is there a description of what each file
> contains?
> (2) The enwiki-latest-pages-articles.xml file uncompresses as
> 31.55GB. Is it correct that this contains the current snapshot of all
> pages and articles in wikipedia? (I only ask because this seems
> small)
> (3) If I am constrained to use latest-pages-articles.xml, I'm unclear
> on the method used to denote a link. It would appear that links are
> denoted by [[link]] or [[link | word]]. Such patterns would be fairly
> easy to find using perl. However, I've noticed some odd cases, such
> as
>
> "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
> first to formulate ...... in his
> work".<refname="EB1910" />]]"
>
> If I must search through the page-articles file, and if the [[ ]]
> notation is overloaded, is there a description of the patterns that
> are used in this file? I.e. a way for me to ensure that I'm only
> grabbing links, not figure captions or some other content.
>
> Thanks for your help!
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20111007/2be76ae0/attachment.htm
More information about the Xmldatadumps-l
mailing list