Re: [Wiki-research-l] Finding the number of links between two wikipedia pages

21 Feb 2017


      I've built and open sourced technology that can extract these sorts of unplanned features from dumps. The system would be primed with specific dumps and trained for the feature of interest. This might take a day.  Then a full run would take an hour to produce a csv file of features for downstream study.
Would this be of interest?
Best regards -- Ward
...
On Feb 21, 2017, at 8:48 AM, Giuseppe Profiti profgiuseppe@gmail.com wrote:
2017-02-19 20:56 GMT+01:00 Mara Sorella sorella@dis.uniroma1.it:
...
Hi everybody, I'm new to the list and have been referred here by a comment
from a SO user as per my question [1], that I'm quoting next:
I have been successfully able to use the Wikipedia pagelinks SQL dump to
obtain hyperlinks between Wikipedia pages for a specific revision time.
However, there are cases where multiple instances of such links exist, e.g.
the very same https://en.wikipedia.org/wiki/Wikipedia page and
https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find
number of links between pairs of pages for a specific revision.
Ideal solutions would involve dump files other than pagelinks (which I'm not
aware of), or using the MediaWiki API.
To elaborate, I need this information to weight (almost) every hyperlink
between article pages (that is, in NS0), that was present in a specific
wikipedia revision (end of 2015), therefore, I would prefer not to follow
the solution suggested by the SO user, that would be rather impractical.
Hi Mara,
Mediawiki API does not return the multiplicity of the links [1]. As
far as I can see from the database layout, you can't get the
multiplicity of links from it either [2]. The only solution that
occurs to me is to parse the wikitext of the page, like the SO user
suggested.
In any case, some communities established writing styles that
discourage multiple links towards the same article (i.e. in the
Italian Wikipedia a link is associated only to the first occurrence of
the word). Then, the numbers you could get may vary depending on the
style of the community and/or last editor.
...
Indeed, my final aim is to use this weight in a thresholding fashion to
sparsify the wikipedia graph (that due to the short diameter is more or less
a giant connected component), in a way that should reflect the "relatedness"
of the linked pages (where relatedness is not intended as strictly semantic,
but at a higher "concept" level, if I may say so).
For this reason, other suggestions on how determine such weights (possibly
using other data sources -- ontologies?) are more than welcome.
When you get the graph of connections, instead of using the
multiplicity as weight, you could try to use community detection
methods to isolate subclusters of strongly connected articles.
Another approach my be to use centrality measures, however the only
one that can be applied to edges instead of just nodes is betweenness
centrality, if I remember correctly.
In case of a fast technical solution may come to mind, I'll write here again.
Best,
Giuseppe
[1] https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Wi...
[2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWiki_1.28.0_databas...

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Finding the number of links between two wikipedia pages