Hi Samuel,
If you haven't already seen it, take a look at the following dataset. It may become handy in your case: http://deepdive.stanford.edu/opendata/#wiki-wikipedia-english-edition
Best, Leila
---
Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Feb 6, 2017 at 6:25 AM, Dimitris Kontokostas jimkont@gmail.com wrote:
I am quoting a response by my colleague Martin Brummer (in cc) that answered a similar question recently
there are the DBpedia NIF abstract datasets which contain DBpedia abstracts, article structure annotations and entity links contained in the abstracts, currently available in 9 languages.[1] Entity links in that datasets are only the links set by Wikipedia editors. This means each linked entity is only linked once in the article (the first time it is mentioned). Repeat mentions of the entity are not linked again. [...Martin & Milan...] tried to remedy this issue by additionally linking other surface forms of entities previously mentioned in the abstract in this older version of the corpus, available in 7 languages [2]. [1] http://wiki.dbpedia.org/nif-abstract-datasets [2] https://datahub.io/dataset/dbpedia-abstract-corpus
DBpedia is also working on providing the whole Wikipedia pages in NIF format with annotated links. These will be available for the upcoming release.
As Markus said, switching WIkipedia/DBpedia IRIs to Wikidata should be trivial when Wikidata IRIs exist.
Best, Dimitris
On Mon, Feb 6, 2017 at 4:04 PM, Shilad Sen ssen@macalester.edu wrote:
Whoops! Apologies for shorting your name to "Sam." Looks like the coffee has not yet kicked in this morning...
On Mon, Feb 6, 2017 at 8:02 AM, Shilad Sen ssen@macalester.edu wrote:
Hi Sam,
The NLP task you are referring to is often called "wikification," and if you Google using that term you'll find some hits for datasets. Here's the first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4
I also have a full EN corpus marked up by a simple Wikification algorithm. It's not very good, but you are welcome to it!
-Shilad
On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz samuel.printz@outlook.de wrote:
Hello Markus,
to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.
Thank you!
Samuel
Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:
On 05.02.2017 15:47, Samuel Printz wrote:
Hello everyone,
I am looking for a text corpus that is annotated with Wikidata
entites.
I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.
Does such a corpus exist?
Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.
I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.
The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.
However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.
Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata
easily.
Good luck,
Markus
P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)
Thanks for hints! Samuel
[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Shilad W. Sen
Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College
Senior Research Fellow, Target Corporation
ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>
-- Shilad W. Sen
Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College
Senior Research Fellow, Target Corporation
ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Kontokostas Dimitris
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata