Wikipedia dataset to support NLP disambiguation - Analytics

11 Dec 2013

Greetings!

I'm a Professor at Macalester College in Minnesota, and I have been
collaborating with Brent Hecht and many students to develop a Java
framework for extracting multilingual knowledge from Wikipedia [1]. The
framework is pre-alpha now, but we hope to offer a stable release in the
next month.

Given a phrase (e.g. "apple"), our library must identifying articles
associated with a phrase. This is a probabilistic question. How likely is
the phrase "apple" to refer to the article about the fruit vs the company?
This simple task (often called Wikification or disambiguation) forms the
basis of many NLP algorithms.

Google and Stanford have released an awesome dataset to support this task
[2]. It contains the *text* of all internet hyperlinks to Wikipedia
articles. This dataset makes the problem much easier, but it has two
serious deficiencies. First, it only contains links to articles in English
Wikipedia. Second, it was generated once by Google, and it is unlikely
Google will update it.

The WMF could create a similar dataset by publishing the most common
inbound search queries for all WP pages across all language editions. This
dataset would enable individuals, researchers and small companies (not just
Google and Microsoft) to harness Wikipedia data for their applications.

Does this seem remotely possible? I've thought a little about engineering
and privacy issues related to the dataset. Neither are trivial, but I think
they are feasible, and I'd be happy to volunteer my engineering effort.

If you think the idea has legs, how do we develop a more formal proposal
 about the dataset?

Thanks for your feedback!

-Shilad

[1] https://github.com/shilad/wikAPIdia
[2]
http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.…

-- 
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273