Hi All,
for an applied research work, I am working on extracting links from the
Wikipedia corpus.
I've been using in the past the XML streams, but not I was hoping to speed
up and handle better the situation by parsing the sql tables.
However, I am stuck on this:
I could not find a way to filter the relevant links.
I can only filter by namespace apparently, while I want to only keep the
links that were mentioned in the main text, still namespace 0, but not
belonging to the infoboxes and navboxes menu.
How could I do that?
Is there any information that a link belongs to a menu or to the main
content, beyond the namespace?
Thanks All for your help,
L.