Here's big data dataset from Google Research and UMass IESL, 40 million "links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page," from 10 million web pages, for the purposes of contextualized disambiguation:
Learning from Big Data: 40 Million Entities in Context http://googleresearch.blogspot.co.uk/2013/03/learning-from-big-data-40-milli...
In hopes that the work might be interesting or useful to the folks here,
Pete
On Sat, Mar 9, 2013 at 6:32 PM, Peter Kaminski kaminski@istori.com wrote:
Here's big data dataset from Google Research and UMass IESL, 40 million "links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page," from 10 million web pages, for the purposes of contextualized disambiguation:
I wonder how many disambiguation links to Wikipedia fail to disambiguate. People assume Wikipedia has a link and it never lets you down, but it's often the wrong thing. E.g. "_John McLaughlin_ formed Mahavishnu Orchestra " (links to a disambiguation page) or "gerrit is written in _Java_" (links to the island, not the language). _John Howard_ will no longer link to the Australian politician if someone more famous comes along.
"how to find out if different web pages are talking about the same person or other entity" Wikidata removes all doubt, http://www.wikidata.org/wiki/Q164757 ! I assume that other knowledge projects have noticed these entities, and that Q numbers are becoming a lingua franca. I'm reserving Q42666789 for my talented sure-to-be famous offspring. :-)
Google clearly enjoys the fruits of Wikipedians' hard work.
-- =S Page software engineer on Editor Engagement Experiments
wikitech-l@lists.wikimedia.org