"Jason Y. Lee" jylee@cs.ucr.edu wrote:
I am in the current process of re-writing this function, but in case
anyone
wants to beat me to it, I suggest the following all encompassing regular expression to use:
re.compile('<li><a href=".*?" title=".*?">(.*?)</a>
*(*(inclusion|redirect page)*)*.*?</li>')
group(1) will give you the title, and group(2) of the search will be
either:
'', 'inclusion', 'redirect page'
This will only work in the English Wikipedia. In es.wikipedia.org, for example, group(2) would be "página redirigida" for a redirect page, although it is still "inclusion" (no accented character, interestingly) for a template inclusion.