Re: [Analytics] Wikipedia dataset to support NLP disambiguation

16 Dec 2013

One tool that may be useful to you is that Wikipedia now marks all
disambiguation pages as such in the database (as of approximately June
2013). There is also an API available with equivalent functionality (see
https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage). You can
also use this data to generate a complete list of all disambiguation pages
on Wikipedia. You can view such a list on-wiki at
https://en.wikipedia.org/wiki/Special:DisambiguationPages. So for, example,
we see that 'Apple (disambiguation)' is listed, which tells us two pieces
of information:
1. 'Apple' is a term which may require disambiguation
2. 'Apple' has one clear dominant usage, otherwise it would be listed as
'Apple' rather than 'Apple (disambiguation)'

For more information about this, look at the documentation at
https://www.mediawiki.org/wiki/Extension:Disambiguator.

Ryan Kaldari

On Tue, Dec 10, 2013 at 10:09 PM, Shilad Sen &lt;ssen(a)macalester.edu&gt; wrote:

...
  Greetings!

 I'm a Professor at Macalester College in Minnesota, and I have been
 collaborating with Brent Hecht and many students to develop a Java
 framework for extracting multilingual knowledge from Wikipedia [1]. The
 framework is pre-alpha now, but we hope to offer a stable release in the
 next month.

 Given a phrase (e.g. "apple"), our library must identifying articles
 associated with a phrase. This is a probabilistic question. How likely is
 the phrase "apple" to refer to the article about the fruit vs the company?
 This simple task (often called Wikification or disambiguation) forms the
 basis of many NLP algorithms.

 Google and Stanford have released an awesome dataset to support this task
 [2]. It contains the *text* of all internet hyperlinks to Wikipedia
 articles. This dataset makes the problem much easier, but it has two
 serious deficiencies. First, it only contains links to articles in English
 Wikipedia. Second, it was generated once by Google, and it is unlikely
 Google will update it.

 The WMF could create a similar dataset by publishing the most common
 inbound search queries for all WP pages across all language editions. This
 dataset would enable individuals, researchers and small companies (not just
 Google and Microsoft) to harness Wikipedia data for their applications.

 Does this seem remotely possible? I've thought a little about engineering
 and privacy issues related to the dataset. Neither are trivial, but I think
 they are feasible, and I'd be happy to volunteer my engineering effort.

 If you think the idea has legs, how do we develop a more formal proposal
  about the dataset?

 Thanks for your feedback!

 -Shilad

 [1] https://github.com/shilad/wikAPIdia
 [2]
 http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.…

 --
 Shilad W. Sen
 Assistant Professor
 Mathematics, Statistics, and Computer Science Dept.
 Macalester College
 ssen(a)macalester.edu
 651-696-6273

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Wikipedia dataset to support NLP disambiguation