Hi L.,
There is unfortunately no table that tracks how links are included on a
page (hard-coded via wikitext or transcluded via templates/lua). It sounds
like you're already aware of the pagelinks
<https://www.mediawiki.org/wiki/Manual:Pagelinks_table> table, which can be
easily parsed with the mwsql <https://pypi.org/project/mwsql/> library if
you're working in Python. That leaves you with two options generally:
- Extract the links from the raw wikitext XML dumps, which would achieve
what you want. In Python, the easiest way is via mwxml
<https://pypi.org/project/mwxml/> for the dumps and mwparserfromhell
<https://github.com/earwig/mwparserfromhell> for extracting the links.
You could use the mwconstants <https://pypi.org/project/mwconstants/>
library then for filtering down to just the namespace you're interested in.
- Extract the links from the HTML dumps
<https://dumps.wikimedia.org/other/enterprise_html/>. This gives you all
the links in the article and you could separate between the transcluded
ones and non-transcluded ones. There's a work-in-progress Python library
for this too called mwparserfromhtml
<https://pypi.org/project/mwparserfromhtml/> (see blogpost about it
<https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/>).
Sorry that doesn't solve you problem but hope that helps. If you're curious
about why this isn't supported, you can read through some of the past
discussion <https://phabricator.wikimedia.org/T278236> around adding this
sort of functionality (essentially the links tables are already massive so
adding more information is not desirable at the moment).
Best,
Isaac
On Thu, Feb 16, 2023 at 5:13 PM Luigi Assom <luigi.assom(a)gmail.com> wrote:
Hi All,
for an applied research work, I am working on extracting links from the
Wikipedia corpus.
I've been using in the past the XML streams, but not I was hoping to speed
up and handle better the situation by parsing the sql tables.
However, I am stuck on this:
I could not find a way to filter the relevant links.
I can only filter by namespace apparently, while I want to only keep the
links that were mentioned in the main text, still namespace 0, but not
belonging to the infoboxes and navboxes menu.
How could I do that?
Is there any information that a link belongs to a menu or to the main
content, beyond the namespace?
Thanks All for your help,
L.
_______________________________________________
Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org
--
Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
Foundation