I don't know of such a resource off-hand, but you might want to consider expanding your search to text corpuses annotated with Freebase or Google Knowledge Graph IDs (the same IDs are used for both). Wikidata contains mappings to Freebase IDs, although it is somewhat incomplete (and this additional mapping adds an extra layer of variability).

The other issue is that all of the corpuses that I'm aware of are automatically annotated, so their not "gold standard" truth sets, but you could cherry pick the high confidence annotations and/or do additional human verification.

Two that I know of are:

ClueWeb09 & ClueWeb12 - 800M documents, 11B "clues" - https://research.googleblog.com/2013/07/11-billion-clues-in-800-million.html

TREC KBA Stream Corpus 2014 - 394M documents, 9.4B mentions - http://trec-kba.org/data/fakba1/

I haven't seen any recent releases of similar stuff. Not sure what identifiers Google will use for this kind of work in the future now that they've shutdown Freebase.

Tom

On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz <samuel.printz@outlook.de> wrote:

Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites.
I need this for the evaluation of an entity linking tool based on
Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to
use GERBIL [2] for the evaluation. But it is not necessary.

Thanks for hints!
Samuel

[1] https://site.nlp2rdf.org/
[2] http://aksw.org/Projects/GERBIL.html

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata