Re: [Xmldatadumps-l] Matching and retrieving List pages

18 Oct 2013

In the pages-articles dump, you shouldn't be looking just at the
content of the pages, you should be looking at the title tag.
That's how you reliably find out what is the title of the page with
the given content.

Alternatively, you could use page ids instead of titles (look for the
id tag in the XML dump).

Petr Onderka
[[en:User:Svick]]

On Fri, Oct 18, 2013 at 10:02 PM, Peyman Faratin &lt;peyman(a)robustlinks.com&gt; wrote:
...
  Hi

 I've a set of list page titles that i've extracted from the Category dump
 (where "cl_from" is of type "page")

 http://www.mediawiki.org/wiki/Manual:Categorylinks_table

 Now I want to extract the CONTENT of the page from the pages dump

 enwiki-latest-pages-articles.xml

 Although there are guidelines on how editors should mark these pages

 http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists

 ("The titles of list articles typically begin with the type of list it is
 (List of, Index of, etc.), followed by the article's subject; like: List of
 vegetable oils.")

 The majority of the times the above rule is not implemented. So my concrete
 question is:

 - if i am consuming the pages-articles.xml dump (D) page by page, and i have
 a list of pages (L) i've extracted from the category (C) links dump, then
 how can i check that page in pages dump file D is a member of L? The titles
 do not resolve the names.

 For instance, if I have the page title "List of the longest Asian rivers"
 (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in
 that page's content

(http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&…)
 can tell me it is the same page "List of the longest Asian rivers"?
 None-list pages appear to place the title as first token with ''' markings.

 Any suggestions of a robust solution would be much appreciated.

 Best

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Matching and retrieving List pages