I've a set of list page titles that i've extracted from the Category dump (where "cl_from" is of type "page") 


Now I want to extract the CONTENT of the page from the pages dump


Although there are guidelines on how editors should mark these pages 


("The titles of list articles typically begin with the type of list it is (List ofIndex of, etc.), followed by the article's subject; like: List of vegetable oils.")

The majority of the times the above rule is not implemented. So my concrete question is:

- if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of pages (L) i've extracted from the category (C) links dump, then how can i check that page in pages dump file D is a member of L? The titles do not resolve the names. 

For instance, if I have the page title "List of the longest Asian rivers" (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that page's content (http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&action=edit) can tell me it is the same page "List of the longest Asian rivers"? None-list pages appear to place the title as first token with ''' markings. 

Any suggestions of a robust solution would be much appreciated.