Wikipedia corpora from Google - Wiki-research-l

7 Dec 2013

Google has released over time a huge amount of open data from or about Wikipedia. Check
them out:

http://googleresearch.blogspot.com/2013/12/free-language-lessons-for-comput…

Some highlights:

50,000 Lessons on How to Read: a Relation Extraction Corpus

What is it: A human-judged dataset of two relations involving public figures on Wikipedia:
about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated
from an institution.” 

40 Million Entities in Context

What is it: A disambiguation set consisting of pointers to 10 million web pages with 40
million entities that have links to Wikipedia. This is another entity resolution corpus,
since the links can be used to disambiguate the mentions, but unlike the ClueWeb example
above, the links are inserted by the web page authors and can therefore be considered
human annotation.

Distributing the Edit History of Wikipedia Infoboxes

What is it: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy
resource. Attributes on Wikipedia change over time, and some of them change more than
others. Understanding attribute change is important for extracting accurate and useful
information from Wikipedia.

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with
7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts
in this case are Wikipedia articles, and the strings are anchor text spans that link to
the concepts in question.

Dario

(ht Nicolas Torzec)