Hi Haifeng,
Yes, you might want to look into some of the work done by Hecht et al. on content similarity between languages, as well as work by Sen et al. on semantic relatedness algorithms (which are implemented in the WikiBrain framework http://wikibrainapi.org/, by the way, see reference below). Some paper to start out with could be:
- Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M. and Gergle, D. "Omnipedia: Bridging the Wikipedia Language Gap http://www.brenthecht.com/publications/bhecht_CHI2012_omnipedia.pdf" CHI 2012 - Hecht, B. and Gergle, D. "The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context http://www.brenthecht.com/publications/bhecht_chi2010_towerofbabel.pdf" CHI 2010 - Shilad Sen, Anja Beth Swoap, Qisheng Li, Brooke Boatman, Ilse Dippenaar, Rebecca Gold, Monica Ngo, Sarah Pujol, Bret Jackson, Brent Hecht "Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement http://www.shilad.com/static/cartograph-iui-2017-final.pdf" IUI 2017 - Sen, Shilad; Johnson, Isaac; Harper, Rebecca; Mai, Huy; Horlbeck Olsen, Samuel; Mathers, Benjamin; Souza Vonessen, Laura; Wright, Matthew; Hecht, Brent "Towards Domain-Specific Semantic Relatedness: A Case Study in Geography http://ijcai.org/papers15/Papers/IJCAI15-334.pdf" IJCAI, 2015 - Sen, Shilad; Lesicko, Matthew; Giesel, Margaret; Gold, Rebecca; Hillmann, Benjamin; Naden, Samuel; Russell, Jesse; Wang, Zixiao "Ken"; Hecht, Brent "Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards http://www-users.cs.umn.edu/~bhecht/publications/goldstandards_CSCW2015.pdf " - Sen, Shilad; Li, Toby Jia-Jun; Lesicko, Matthew; Weiland, Ari; Gold, Rebecca; Li, Yulun; Hillmann, Benjamin; Hecht, Brent "WikiBrain: Democratizing computation on Wikipedia http://www-users.cs.umn.edu/~bhecht/publications/WikiBrain-WikiSym2014.pdf" OpenSym 2014
You can of course also utilize similarity measures from the recommender systems and information retrieval fields, e.g. use edit histories to identify articles who have been edited by the same users, or apply search engine techniques like TF/IDF and content vectors.
Cheers, Morten
On Sat, 4 May 2019 at 04:48, Haifeng Zhang haifeng1@andrew.cmu.edu wrote:
Dear folks,
Is there a way to compute content similarity between two Wikipedia articles?
For example, I can think of representing each article as a vector of likelihoods over possible topics.
But, I wonder there are other work people have already explored in the past.
Thanks,
Haifeng _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l