NLP text corpus annotated with Wikidata entities?

List overview All Threads
Download

newer

older

Tips for figuring out how to...

Next IRC office hour on April 5th

Samuel Printz

5 Feb 2017 5 Feb '17

8:17 p.m.

Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Show replies by date

Markus Kroetzsch

6 Feb 6 Feb

2:10 a.m.

On 05.02.2017 15:47, Samuel Printz wrote:

...

Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...

Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Samuel Printz

2:58 p.m.

Hello Markus,

to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.

Thank you!

Samuel

Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:

...

On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Shilad Sen

7:32 p.m.

Hi Sam,

The NLP task you are referring to is often called "wikification," and if you Google using that term you'll find some hits for datasets. Here's the first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4

I also have a full EN corpus marked up by a simple Wikification algorithm. It's not very good, but you are welcome to it!

-Shilad

On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz samuel.printz@outlook.de wrote:

...

Hello Markus,

to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.

Thank you!

Samuel

Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:

...
On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Shilad W. Sen Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College Senior Research Fellow, Target Corporation ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273

Shilad Sen

7:34 p.m.

Whoops! Apologies for shorting your name to "Sam." Looks like the coffee has not yet kicked in this morning...

On Mon, Feb 6, 2017 at 8:02 AM, Shilad Sen ssen@macalester.edu wrote:

...

Hi Sam,

The NLP task you are referring to is often called "wikification," and if you Google using that term you'll find some hits for datasets. Here's the first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4

I also have a full EN corpus marked up by a simple Wikification algorithm. It's not very good, but you are welcome to it!

-Shilad

On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz samuel.printz@outlook.de wrote:

...
Hello Markus,

to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.

Thank you!

Samuel

Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:

...
On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Shilad W. Sen

Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Senior Research Fellow, Target Corporation

ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>

Dimitris Kontokostas

7:55 p.m.

I am quoting a response by my colleague Martin Brummer (in cc) that answered a similar question recently

``` there are the DBpedia NIF abstract datasets which contain DBpedia abstracts, article structure annotations and entity links contained in the abstracts, currently available in 9 languages.[1]

Entity links in that datasets are only the links set by Wikipedia editors. This means each linked entity is only linked once in the article (the first time it is mentioned). Repeat mentions of the entity are not linked again.

[...Martin & Milan...] tried to remedy this issue by additionally linking other surface forms of entities previously mentioned in the abstract in this older version of the corpus, available in 7 languages [2].

[1] http://wiki.dbpedia.org/nif-abstract-datasets [2] https://datahub.io/dataset/dbpedia-abstract-corpus ```

DBpedia is also working on providing the whole Wikipedia pages in NIF format with annotated links. These will be available for the upcoming release.

As Markus said, switching WIkipedia/DBpedia IRIs to Wikidata should be trivial when Wikidata IRIs exist.

Best, Dimitris

On Mon, Feb 6, 2017 at 4:04 PM, Shilad Sen ssen@macalester.edu wrote:

...

Whoops! Apologies for shorting your name to "Sam." Looks like the coffee has not yet kicked in this morning...

On Mon, Feb 6, 2017 at 8:02 AM, Shilad Sen ssen@macalester.edu wrote:

...
Hi Sam,

The NLP task you are referring to is often called "wikification," and if you Google using that term you'll find some hits for datasets. Here's the first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4

I also have a full EN corpus marked up by a simple Wikification algorithm. It's not very good, but you are welcome to it!

-Shilad

On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz samuel.printz@outlook.de wrote:

...
Hello Markus,

to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.

Thank you!

Samuel

Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:

...
On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata

entites.

...
...
I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Shilad W. Sen

Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Senior Research Fellow, Target Corporation

ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>

-- Shilad W. Sen

Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Senior Research Fellow, Target Corporation

ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Kontokostas Dimitris

Leila Zia

10:46 p.m.

Hi Samuel,

If you haven't already seen it, take a look at the following dataset. It may become handy in your case: http://deepdive.stanford.edu/opendata/#wiki-wikipedia-english-edition

Best, Leila

---

Leila Zia Senior Research Scientist Wikimedia Foundation

On Mon, Feb 6, 2017 at 6:25 AM, Dimitris Kontokostas jimkont@gmail.com wrote:

...

I am quoting a response by my colleague Martin Brummer (in cc) that answered a similar question recently
there are the DBpedia NIF abstract datasets which contain DBpedia
abstracts, article structure annotations and entity links contained in
the abstracts, currently available in 9 languages.[1]

Entity links in that datasets are only the links set by Wikipedia
editors. This means each linked entity is only linked once in the
article (the first time it is mentioned). Repeat mentions of the entity
are not linked again.

[...Martin & Milan...] tried to remedy this issue by additionally linking
other
surface forms of entities previously mentioned in the abstract in this
older version of the corpus, available in 7 languages [2].

[1] http://wiki.dbpedia.org/nif-abstract-datasets
[2] https://datahub.io/dataset/dbpedia-abstract-corpus
DBpedia is also working on providing the whole Wikipedia pages in NIF format with annotated links. These will be available for the upcoming release.

As Markus said, switching WIkipedia/DBpedia IRIs to Wikidata should be trivial when Wikidata IRIs exist.

Best, Dimitris

On Mon, Feb 6, 2017 at 4:04 PM, Shilad Sen ssen@macalester.edu wrote:

...
Whoops! Apologies for shorting your name to "Sam." Looks like the coffee has not yet kicked in this morning...

On Mon, Feb 6, 2017 at 8:02 AM, Shilad Sen ssen@macalester.edu wrote:

...
Hi Sam,

The NLP task you are referring to is often called "wikification," and if you Google using that term you'll find some hits for datasets. Here's the first one I found: https://cogcomp.cs.illinois.edu/page/resource_view/4

I also have a full EN corpus marked up by a simple Wikification algorithm. It's not very good, but you are welcome to it!

-Shilad

On Mon, Feb 6, 2017 at 3:28 AM, Samuel Printz samuel.printz@outlook.de wrote:

...
Hello Markus,

to take a Wikipedia-annotated corpus and replace the the Wikipedia-URIs by the respective Wikidata-URIs is a great idea, I think I'll try that out.

Thank you!

Samuel

Am 05.02.2017 um 21:40 schrieb Markus Kroetzsch:

...
On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata

entites.

...
...
I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata

easily.

...
Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Shilad W. Sen

Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Senior Research Fellow, Target Corporation

ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>

-- Shilad W. Sen

Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Senior Research Fellow, Target Corporation

ssen@macalester.edu http://www.shilad.com https://www.linkedin.com/in/shilad 651-696-6273 <(651)%20696-6273>

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Kontokostas Dimitris

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

fn＠imm.dtu.dk

12 Apr 12 Apr

9:21 p.m.

New subject: Get "subject links" via Wikidata API

To get the works that an person has written It would use SPARQL with something link "SELECT * WHERE { ?work wdt:P50 ?author }".

I could also get the authors of a work via Wikidata MediaWiki API.

My question is whether it is possible to get the works of an author given the author. With my knowledge of the API, I would say it is not possible, except if you do something "Special:WhatLinksHere" (list=backlinks) and process/filter all the results.

Finn Årup Nielsen http://people.compute.dtu.dk/faan/

Magnus Manske

9:27 p.m.

New subject: Get "subject links" via Wikidata API

Just say "wd:Q12345" (the author) instead of "?author" ?

The backlinks thing works, but is tedious. You'll need to load the items via action=wbgetentities to check if that link actually means "author", or some other property.

On Wed, Apr 12, 2017 at 4:52 PM fn@imm.dtu.dk wrote:

...

To get the works that an person has written It would use SPARQL with something link "SELECT * WHERE { ?work wdt:P50 ?author }".

I could also get the authors of a work via Wikidata MediaWiki API.

My question is whether it is possible to get the works of an author given the author. With my knowledge of the API, I would say it is not possible, except if you do something "Special:WhatLinksHere" (list=backlinks) and process/filter all the results.

Finn Årup Nielsen http://people.compute.dtu.dk/faan/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

fn＠imm.dtu.dk

11:07 p.m.

New subject: Get "subject links" via Wikidata API

On 04/12/2017 05:57 PM, Magnus Manske wrote:

...

Just say "wd:Q12345" (the author) instead of "?author" ?

Yes, that is what we do all over in Scholia, e.g., https://tools.wmflabs.org/scholia/author/Q13520818

...

The backlinks thing works, but is tedious. You'll need to load the items via action=wbgetentities to check if that link actually means "author", or some other property.

We got a question from a reviewer asking why we used SPARQL in Scholia and not just MediaWiki API. My initial thought was that it was not possible with MediaWiki API, but then I thought of list=backlinks and followed by (as Magnus points out) action=wbgetentities.

I was afraid that some place hidden in the MediaWiki API would be a query functionality so you could get Wikidata property-filtered backlinks, but since Magnus don't point to them, I am pretty sure now that no such functionality exists. :)

/Finn

...

On Wed, Apr 12, 2017 at 4:52 PM <fn@imm.dtu.dk mailto:fn@imm.dtu.dk> wrote:

To get the works that an person has written It would use SPARQL with
something link "SELECT * WHERE { ?work wdt:P50 ?author }".

I could also get the authors of a work via Wikidata MediaWiki API.

My question is whether it is possible to get the works of an author
given the author. With my knowledge of the API, I would say it is not
possible, except if you do something "Special:WhatLinksHere"
(list=backlinks) and process/filter all the results.


Finn Årup Nielsen
http://people.compute.dtu.dk/faan/

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

James Heald

13 Apr 13 Apr

12:17 a.m.

New subject: Get "subject links" via Wikidata API

One further alternative is the "Linked Data Fragments" (LDF) interface, which is supposed to be a bit lighter on the server than SPARQL -- but only returns a set of triples, so further actions would be needed if you wanted to get labels for them as well.

For example:

https://query.wikidata.org/bigdata/ldf?subject=&predicate=wdt:P50&ob...

-- James.

On 12/04/2017 18:37, fn@imm.dtu.dk wrote:

...

On 04/12/2017 05:57 PM, Magnus Manske wrote:

...
Just say "wd:Q12345" (the author) instead of "?author" ?

Yes, that is what we do all over in Scholia, e.g., https://tools.wmflabs.org/scholia/author/Q13520818

...
The backlinks thing works, but is tedious. You'll need to load the items via action=wbgetentities to check if that link actually means "author", or some other property.

We got a question from a reviewer asking why we used SPARQL in Scholia and not just MediaWiki API. My initial thought was that it was not possible with MediaWiki API, but then I thought of list=backlinks and followed by (as Magnus points out) action=wbgetentities.

I was afraid that some place hidden in the MediaWiki API would be a query functionality so you could get Wikidata property-filtered backlinks, but since Magnus don't point to them, I am pretty sure now that no such functionality exists. :)

/Finn

Aidan Hogan

8:14 p.m.

New subject: Student topics of relevance for Wikidata

Hi all,

So at my university the undergraduate students must do a three-month work towards writing a final short thesis. Generally this work doesn't need to involve research but should result in a final demonstrable outcome, meaning a tool, application, something like that.

The students are in Computer Science and have taken various relevant courses including Semantic Web, Big Data, Data Mining, and so forth.

I was wondering if there was, for example, a list of possible topics internally within Wikidata ... topics that students could work on with some guidance here and there from a professor. (Not necessarily research-level topics, but also implementation or prototyping tasks, perhaps even regarding something more speculative, or "wouldn't it be nice if we could ..." style topics?)

If there is no such list, perhaps it might be a good idea to start thinking about one?

The students I talk with are very interested in doing tasks that could have real-world impact and I think in this setting, working on something relevant to the deployment of Wikidata would be a really great experience for them and hopefully also of benefit to Wikidata.

(And probably there are other professors in a similar context looking for interesting topics to assign students.)

Best, Aidan

Lydia Pintscher

9:14 p.m.

New subject: Student topics of relevance for Wikidata

On Thu, Apr 13, 2017 at 4:44 PM, Aidan Hogan aidhog@gmail.com wrote:

...

Hi all,

So at my university the undergraduate students must do a three-month work towards writing a final short thesis. Generally this work doesn't need to involve research but should result in a final demonstrable outcome, meaning a tool, application, something like that.

The students are in Computer Science and have taken various relevant courses including Semantic Web, Big Data, Data Mining, and so forth.

I was wondering if there was, for example, a list of possible topics internally within Wikidata ... topics that students could work on with some guidance here and there from a professor. (Not necessarily research-level topics, but also implementation or prototyping tasks, perhaps even regarding something more speculative, or "wouldn't it be nice if we could ..." style topics?)

If there is no such list, perhaps it might be a good idea to start thinking about one?

The students I talk with are very interested in doing tasks that could have real-world impact and I think in this setting, working on something relevant to the deployment of Wikidata would be a really great experience for them and hopefully also of benefit to Wikidata.

(And probably there are other professors in a similar context looking for interesting topics to assign students.)

Hi Aidan,

Thanks for reaching out. Such a list exists: https://phabricator.wikimedia.org/T90870 However it doesn't make a lot of sense without some guidance. Some of these topics also already have someone interested in working on them. It is best to have a quick call with me to discuss it.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Aidan Hogan

9:53 p.m.

New subject: Student topics of relevance for Wikidata

This looks great Lydia, thanks!!

The descriptions look like enough for me to catch the idea and explain it to a student.

If such a student is interested, we will let you know. :)

Best! Aidan

On 13-04-2017 12:44, Lydia Pintscher wrote:

...

On Thu, Apr 13, 2017 at 4:44 PM, Aidan Hogan aidhog@gmail.com wrote:

...
Hi all,

So at my university the undergraduate students must do a three-month work towards writing a final short thesis. Generally this work doesn't need to involve research but should result in a final demonstrable outcome, meaning a tool, application, something like that.

The students are in Computer Science and have taken various relevant courses including Semantic Web, Big Data, Data Mining, and so forth.

I was wondering if there was, for example, a list of possible topics internally within Wikidata ... topics that students could work on with some guidance here and there from a professor. (Not necessarily research-level topics, but also implementation or prototyping tasks, perhaps even regarding something more speculative, or "wouldn't it be nice if we could ..." style topics?)

If there is no such list, perhaps it might be a good idea to start thinking about one?

The students I talk with are very interested in doing tasks that could have real-world impact and I think in this setting, working on something relevant to the deployment of Wikidata would be a really great experience for them and hopefully also of benefit to Wikidata.

(And probably there are other professors in a similar context looking for interesting topics to assign students.)

Hi Aidan,

Thanks for reaching out. Such a list exists: https://phabricator.wikimedia.org/T90870 However it doesn't make a lot of sense without some guidance. Some of these topics also already have someone interested in working on them. It is best to have a quick call with me to discuss it.

Cheers Lydia

Gerard Meijssen

7 Feb 7 Feb

12:27 a.m.

Hoi, Would you say that it has value to annotate wikilinks as well as red links not only for Wikipedia and Wikidata quality reasons but that this is an additional reason why it makes sense ?

Would it make sense when words in the text that do not have a link have an item with statements referencing the item for the article? Thanks, GerardM

On 5 February 2017 at 21:40, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

On 05.02.2017 15:47, Samuel Printz wrote:

...
Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

I don't know of any such corpus, but Wikidata is linked with Wikipedia in all languages. You can therefore take any Wikipedia article and find, with very little effort, the Wikidata entity for each link in the text.

The downside of this is that Wikipedia pages do not link all occurrences of all linkable entities. You can get a higher coverage when taking only the first paragraph of each page, but many things will still not be linked.

However, you could also take any existing Wikipedia-page annotated corpus and translate the links to Wikidata in the same way.

Finally, DBpedia also is linked to Wikipedia (in fact, the local names of entities are Wikipedia article names). So if you find any DBpedia-annotated corpus, you can also translate it to Wikidata easily.

Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice if you could publish it for others to save some effort :-)

...
Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tom Morris

6 Feb 6 Feb

11:58 p.m.

I don't know of such a resource off-hand, but you might want to consider expanding your search to text corpuses annotated with Freebase or Google Knowledge Graph IDs (the same IDs are used for both). Wikidata contains mappings to Freebase IDs, although it is somewhat incomplete (and this additional mapping adds an extra layer of variability).

The other issue is that all of the corpuses that I'm aware of are automatically annotated, so their not "gold standard" truth sets, but you could cherry pick the high confidence annotations and/or do additional human verification.

Two that I know of are:

ClueWeb09 & ClueWeb12 - 800M documents, 11B "clues" - https://research.googleblog.com/2013/07/11-billion-clues-in-800-million.html TREC KBA Stream Corpus 2014 - 394M documents, 9.4B mentions - http://trec-kba.org/data/fakba1/

I haven't seen any recent releases of similar stuff. Not sure what identifiers Google will use for this kind of work in the future now that they've shutdown Freebase.

Tom

On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz samuel.printz@outlook.de wrote:

...

Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Samuel Printz

7 Feb 7 Feb

3:40 a.m.

Hi all,

as many of you adviced, I decided not to search for a Wikidata-annotated corpus, but use a mapping between the entities of a "usual" corpus and the Wikidata entities instead.

Thanks for all the recommendations!

Samuel

Am 06.02.2017 um 19:28 schrieb Tom Morris: I don't know of such a resource off-hand, but you might want to consider expanding your search to text corpuses annotated with Freebase or Google Knowledge Graph IDs (the same IDs are used for both). Wikidata contains mappings to Freebase IDs, although it is somewhat incomplete (and this additional mapping adds an extra layer of variability).

Two that I know of are:

I haven't seen any recent releases of similar stuff. Not sure what identifiers Google will use for this kind of work in the future now that they've shutdown Freebase.

Tom

On Sun, Feb 5, 2017 at 9:47 AM, Samuel Printz <samuel.printz@outlook.demailto:samuel.printz@outlook.de> wrote: Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites. I need this for the evaluation of an entity linking tool based on Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to use GERBIL [2] for the evaluation. But it is not necessary.

Thanks for hints! Samuel

[1] https://site.nlp2rdf.org/ [2] http://aksw.org/Projects/GERBIL.html

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.orgmailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

2821

Age (days ago)

2888

Last active (days ago)

wikidata@lists.wikimedia.org

16 comments

12 participants

tags (0)

participants (12)

Aidan Hogan
Dimitris Kontokostas
fn＠imm.dtu.dk
Gerard Meijssen
James Heald
Leila Zia
Lydia Pintscher
Magnus Manske
Markus Kroetzsch
Samuel Printz
Shilad Sen
Tom Morris