Re: [Wikidata] NLP text corpus annotated with Wikidata entities?

7 Feb 2017


      I don't know of such a resource off-hand, but you might want to consider
expanding your search to text corpuses annotated with Freebase or Google
Knowledge Graph IDs (the same IDs are used for both). Wikidata contains
mappings to Freebase IDs, although it is somewhat incomplete (and this
additional mapping adds an extra layer of variability).
The other issue is that all of the corpuses that I'm aware of are
automatically annotated, so their not "gold standard" truth sets, but you
could cherry pick the high confidence annotations and/or do additional
human verification.
Two that I know of are:
ClueWeb09 & ClueWeb12 - 800M documents, 11B "clues" -
https://research.googleblog.com/2013/07/11-billion-clues-in-800-million.html
TREC KBA Stream Corpus 2014 - 394M documents, 9.4B mentions -
http://trec-kba.org/data/fakba1/
I haven't seen any recent releases of similar stuff. Not sure what
identifiers Google will use for this kind of work in the future now that
they've shutdown Freebase.
Tom
On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz samuel.printz@outlook.de
wrote:
...
Hello everyone,
I am looking for a text corpus that is annotated with Wikidata entites.
I need this for the evaluation of an entity linking tool based on
Wikidata, which is part of my bachelor thesis.
Does such a corpus exist?
Ideal would be a corpus annotated in the NIF format [1], as I want to
use GERBIL [2] for the evaluation. But it is not necessary.
Thanks for hints!
Samuel
[1] https://site.nlp2rdf.org/
[2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] NLP text corpus annotated with Wikidata entities?