Hi
I've a set of list page titles that i've extracted from the Category dump (where "cl_from" is of type "page")
http://www.mediawiki.org/wiki/Manual:Categorylinks_table
Now I want to extract the CONTENT of the page from the pages dump
enwiki-latest-pages-articles.xml
Although there are guidelines on how editors should mark these pages
http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists
("The titles of list articles typically begin with the type of list it is (List of, Index of, etc.), followed by the article's subject; like: List of vegetable oils.")
The majority of the times the above rule is not implemented. So my concrete question is:
- if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of pages (L) i've extracted from the category (C) links dump, then how can i check that page in pages dump file D is a member of L? The titles do not resolve the names.
For instance, if I have the page title "List of the longest Asian rivers" (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that page's content (http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&a...) can tell me it is the same page "List of the longest Asian rivers"? None-list pages appear to place the title as first token with ''' markings.
Any suggestions of a robust solution would be much appreciated.
Best