I've a set of list page titles that i've extracted from the Category dump (where "cl_from" is of type "page")

http://www.mediawiki.org/wiki/Manual:Categorylinks_table

Now I want to extract the CONTENT of the page from the pages dump

enwiki-latest-pages-articles.xml

Although there are guidelines on how editors should mark these pages

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists

("The titles of list articles typically begin with the type of list it is (List of, Index of, etc.), followed by the article's subject; like: List of vegetable oils.")

The majority of the times the above rule is not implemented. So my concrete question is:

- if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of pages (L) i've extracted from the category (C) links dump, then how can i check that page in pages dump file D is a member of L? The titles do not resolve the names.

For instance, if I have the page title "List of the longest Asian rivers" (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that page's content (http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&action=edit) can tell me it is the same page "List of the longest Asian rivers"? None-list pages appear to place the title as first token with ''' markings.

Any suggestions of a robust solution would be much appreciated.

Best