Re: [Wiki-research-l] Content similarity between two Wikipedia articles

7 May 2019

Indeed, the purpose does  matter. Is the end goal the content similarity of articles
themselves (perhaps say to detect articles that might be merged) or is the end goal the
relatedness of topics represented by those articles? If the latter is the goal, then the
Wikipedia category system relates articles with some commonality of topic, and distance
between articles via the category hierarchy is an indicator of levels of relatedness.
Similarly navboxes relate articles that have something in common, as do list articles. All
of these three things are manually curated, and may be a much cheaper way to determine
relatedness of topics than messing about with bags of words, etc. But it all really
depends on the end goal.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of
Isaac Johnson
Sent: Wednesday, 8 May 2019 1:35 AM
To: Research into Wikimedia content and communities
&lt;wiki-research-l(a)lists.wikimedia.org&gt;
Subject: Re: [Wiki-research-l] Content similarity between two Wikipedia articles

Hey Haifeng,
On top of all the excellent answers provided, I'd also add that the answer to your
question depends on what you want to use the similarity scores for.
For some insight into what it might mean to make choose one approach over another, see
this recent publication:
https://dl.acm.org/citation.cfm?id=3213769

At a high level, I'd say that there are three ways you might approach article
similarity on Wikipedia:
* Reader similarity: two articles are similar if the same people who read one also
frequently read the other. Navigation embeddings that implement this definition based on
page views were generated last in 2017, so newer articles will not be represented, but
here is the dataset [
https://figshare.com/articles/Wikipedia_Vectors/3146878 ] and meta page [
https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors ].
The clickstream dataset [
https://dumps.wikimedia.org/other/clickstream/readme.html ], which is more recent, might
be used in a similar way.
* Content similarity: two articles are similar if they contain similar content -- i.e. in
most cases, similar text. This covers most of the suggestions provided to you in this
email chain. Some are simpler but are language-specific unless you make substantial
modifications (e.g., ESA, the LDA model described here:
https://cs.stanford.edu/people/jure/pubs/wikipedia-www17.pdf) while others are more
complicated but work across multiple languages (e.g., recent WSDM
paper: https://twitter.com/cervisiarius/status/1115510356976242688).
* Link similarity: two articles are similar if they link to similar articles. Generally,
this approach involves creating a graph of Wikipedia's link structure and then using
an approach such as node2vec to reduce the graph to article embeddings. I know less about
the current approaches in this space, but some searching should turn up a variety of
approaches -- e.g., Milne and Witten's 2008 approach [
http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf ], which is implemented
in WikiBrain as Morten mentioned.

There are also other, more structured approaches like ORES drafttopic, which predicts
which topics (based on WikiProjects) are most likely to apply to a given English Wikipedia
article:
https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic

On Tue, May 7, 2019 at 9:54 AM &lt;fn(a)imm.dtu.dk&gt; wrote:

...
  Dear Haifeng,

 Would you not be able to use ordinary information retrieval techniques 
 such as bag-of-words/phrases and tfidf? Explicit semantic analysis 
 (ESA) uses this approach (though its primary focus is word semantic similarity).

 There are a few papers for ESA:
 https://tools.wmflabs.org/scholia/topic/Q5421270

 I have also used it in "Open semantic analysis: The case of word level 
 semantics in Danish"

 http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7
 029.pdf

 Finn Årup Nielsen
 http://people.compute.dtu.dk/faan/

 On 04/05/2019 13:47, Haifeng Zhang wrote:
  Dear folks,

 Is there a way to compute content similarity between two Wikipedia  articles?

 For example, I can think of representing each article as a vector of  likelihoods
over possible topics.

 But, I wonder there are other work people have already explored in 
 the  past.

 Thanks,

 Haifeng
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Content similarity between two Wikipedia articles