Hey Haifeng,
On top of all the excellent answers provided, I'd also add that the answer
to your question depends on what you want to use the similarity scores for.
For some insight into what it might mean to make choose one approach over
another, see this recent publication:
https://dl.acm.org/citation.cfm?id=3213769
At a high level, I'd say that there are three ways you might approach
article similarity on Wikipedia:
* Reader similarity: two articles are similar if the same people who read
one also frequently read the other. Navigation embeddings that implement
this definition based on page views were generated last in 2017, so newer
articles will not be represented, but here is the dataset [
https://figshare.com/articles/Wikipedia_Vectors/3146878 ] and meta page [
https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors ].
The clickstream dataset [
https://dumps.wikimedia.org/other/clickstream/readme.html ], which is more
recent, might be used in a similar way.
* Content similarity: two articles are similar if they contain similar
content -- i.e. in most cases, similar text. This covers most of the
suggestions provided to you in this email chain. Some are simpler but are
language-specific unless you make substantial modifications (e.g., ESA, the
LDA model described here:
https://cs.stanford.edu/people/jure/pubs/wikipedia-www17.pdf) while others
are more complicated but work across multiple languages (e.g., recent WSDM
paper:
https://twitter.com/cervisiarius/status/1115510356976242688).
* Link similarity: two articles are similar if they link to similar
articles. Generally, this approach involves creating a graph of Wikipedia's
link structure and then using an approach such as node2vec to reduce the
graph to article embeddings. I know less about the current approaches in
this space, but some searching should turn up a variety of approaches --
e.g., Milne and Witten's 2008 approach [
http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf ], which
is implemented in WikiBrain as Morten mentioned.
There are also other, more structured approaches like ORES drafttopic,
which predicts which topics (based on WikiProjects) are most likely to
apply to a given English Wikipedia article:
https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic
On Tue, May 7, 2019 at 9:54 AM <fn(a)imm.dtu.dk> wrote:
Dear Haifeng,
Would you not be able to use ordinary information retrieval techniques
such as bag-of-words/phrases and tfidf? Explicit semantic analysis (ESA)
uses this approach (though its primary focus is word semantic similarity).
There are a few papers for ESA:
https://tools.wmflabs.org/scholia/topic/Q5421270
I have also used it in "Open semantic analysis: The case of word level
semantics in Danish"
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7029.p…
Finn Årup Nielsen
http://people.compute.dtu.dk/faan/
On 04/05/2019 13:47, Haifeng Zhang wrote:
Dear folks,
Is there a way to compute content similarity between two Wikipedia
articles?
For example, I can think of representing each article as a vector of
likelihoods
over possible topics.
But, I wonder there are other work people have already explored in the
past.
Thanks,
Haifeng
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation