Re: [Wikidata] NLP text corpus annotated with Wikidata entities?

6 Feb 2017

I am quoting a response by my colleague Martin Brummer (in cc) that
answered a similar question recently

```
there are the DBpedia NIF abstract datasets which contain DBpedia
abstracts, article structure annotations and entity links contained in
the abstracts, currently available in 9 languages.[1]

Entity links in that datasets are only the links set by Wikipedia
editors. This means each linked entity is only linked once in the
article (the first time it is mentioned). Repeat mentions of the entity
are not linked again.

[...Martin & Milan...] tried to remedy this issue by additionally linking
other
surface forms of entities previously mentioned in the abstract in this
older version of the corpus, available in 7 languages [2].

[1] http://wiki.dbpedia.org/nif-abstract-datasets
[2] https://datahub.io/dataset/dbpedia-abstract-corpus
```

DBpedia is also working on providing the whole Wikipedia pages in NIF
format with annotated links.
These will be available for the upcoming release.

As Markus said, switching WIkipedia/DBpedia IRIs to Wikidata should be
trivial when Wikidata IRIs exist.

Best,
Dimitris

On Mon, Feb 6, 2017 at 4:04 PM, Shilad Sen &lt;ssen(a)macalester.edu&gt; wrote:

...
  Whoops! Apologies for shorting your name to
"Sam." Looks like the coffee
 has not yet kicked in this morning...

 On Mon, Feb 6, 2017 at 8:02 AM, Shilad Sen &lt;ssen(a)macalester.edu&gt; wrote:

  Hi Sam,

 The NLP task you are referring to is often called "wikification," and if
 you Google using that term you'll find some hits for datasets. Here's the
 first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4

 I also have a full EN corpus marked up by a simple Wikification
 algorithm. It's not very good, but you are welcome to it!

 -Shilad

 On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz &lt;samuel.printz(a)outlook.de&gt;
 wrote:

  Hello Markus,

 to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs
 by the respective Wikidata-URIs is a great idea, I think I'll try that
 out.

 Thank you!

 Samuel

 Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:
  On 05.02.2017 15:47, Samuel Printz wrote:
> Hello everyone,
>
> I am looking for a text corpus that is annotated with Wikidata  entites.
 > I need this for the evaluation of an entity
linking tool based on
> Wikidata, which is part of my bachelor thesis.
>
> Does such a corpus exist?
>
> Ideal would be a corpus annotated in the NIF format [1], as I want to
> use GERBIL [2] for the evaluation. But it is not necessary.

 I don't know of any such corpus, but Wikidata is linked with Wikipedia
 in all languages. You can therefore take any Wikipedia article and
 find, with very little effort, the Wikidata entity for each link in
 the text.

 The downside of this is that Wikipedia pages do not link all
 occurrences of all linkable entities. You can get a higher coverage
 when taking only the first paragraph of each page, but many things
 will still not be linked.

 However, you could also take any existing Wikipedia-page annotated
 corpus and translate the links to Wikidata in the same way.

 Finally, DBpedia also is linked to Wikipedia (in fact, the local names
 of entities are Wikipedia article names). So if you find any
 DBpedia-annotated corpus, you can also translate it to Wikidata easily.

 Good luck,

 Markus

 P.S. If you build such a corpus from another resource, it would be
 nice if you could publish it for others to save some effort :-)

>
> Thanks for hints!
> Samuel
>
> [1] https://site.nlp2rdf.org/
> [2] http://aksw.org/Projects/GERBIL.html
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata 
 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 --
 Shilad W. Sen

 Associate Professor
 Mathematics, Statistics, and Computer Science Dept.
 Macalester College

 Senior Research Fellow, Target Corporation

 ssen(a)macalester.edu
 http://www.shilad.com
 https://www.linkedin.com/in/shilad
 651-696-6273 <(651)%20696-6273>

 --
 Shilad W. Sen

 Associate Professor
 Mathematics, Statistics, and Computer Science Dept.
 Macalester College

 Senior Research Fellow, Target Corporation

 ssen(a)macalester.edu
 http://www.shilad.com
 https://www.linkedin.com/in/shilad
 651-696-6273

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

-- 
Kontokostas Dimitris

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] NLP text corpus annotated with Wikidata entities?