Matching and retrieving List pages - Xmldatadumps-l

18 Oct 2013

Hi

I've a set of list page titles that i've extracted from the Category dump (where
"cl_from" is of type "page") 

http://www.mediawiki.org/wiki/Manual:Categorylinks_table

Now I want to extract the CONTENT of the page from the pages dump

enwiki-latest-pages-articles.xml

Although there are guidelines on how editors should mark these pages 

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists

("The titles of list articles typically begin with the type of list it is (List of,
Index of, etc.), followed by the article's subject; like: List of vegetable
oils.")

The majority of the times the above rule is not implemented. So my concrete question is:

- if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of
pages (L) i've extracted from the category (C) links dump, then how can i check that
page in pages dump file D is a member of L? The titles do not resolve the names. 

For instance, if I have the page title "List of the longest Asian rivers"
(http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that
page's content
(http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&…)
can tell me it is the same page "List of the longest Asian rivers"? None-list
pages appear to place the title as first token with ''' markings. 

Any suggestions of a robust solution would be much appreciated. 

Best