Hi L., There is unfortunately no table that tracks how links are included on a page (hard-coded via wikitext or transcluded via templates/lua). It sounds like you're already aware of the pagelinks https://www.mediawiki.org/wiki/Manual:Pagelinks_table table, which can be easily parsed with the mwsql https://pypi.org/project/mwsql/ library if you're working in Python. That leaves you with two options generally:
- Extract the links from the raw wikitext XML dumps, which would achieve what you want. In Python, the easiest way is via mwxml https://pypi.org/project/mwxml/ for the dumps and mwparserfromhell https://github.com/earwig/mwparserfromhell for extracting the links. You could use the mwconstants https://pypi.org/project/mwconstants/ library then for filtering down to just the namespace you're interested in. - Extract the links from the HTML dumps https://dumps.wikimedia.org/other/enterprise_html/. This gives you all the links in the article and you could separate between the transcluded ones and non-transcluded ones. There's a work-in-progress Python library for this too called mwparserfromhtml https://pypi.org/project/mwparserfromhtml/ (see blogpost about it https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/).
Sorry that doesn't solve you problem but hope that helps. If you're curious about why this isn't supported, you can read through some of the past discussion https://phabricator.wikimedia.org/T278236 around adding this sort of functionality (essentially the links tables are already massive so adding more information is not desirable at the moment).
Best, Isaac
On Thu, Feb 16, 2023 at 5:13 PM Luigi Assom luigi.assom@gmail.com wrote:
Hi All,
for an applied research work, I am working on extracting links from the Wikipedia corpus.
I've been using in the past the XML streams, but not I was hoping to speed up and handle better the situation by parsing the sql tables.
However, I am stuck on this:
I could not find a way to filter the relevant links.
I can only filter by namespace apparently, while I want to only keep the links that were mentioned in the main text, still namespace 0, but not belonging to the infoboxes and navboxes menu.
How could I do that? Is there any information that a link belongs to a menu or to the main content, beyond the namespace?
Thanks All for your help, L. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org