Tool for consuming left-over data from import

List overview All Threads
Download

newer

older

Wikidata Query Service and SPARQL...

Weekly Summary #272

André Costa

4 Aug 2017 4 Aug '17

9:57 a.m.

Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.

In the course of doing this we keep a note of the data which we fail to migrate. For each of these left-over bits we know which item and which property it belongs to as well as the source field and language from the Wikipedia list. An example would e.g. be a "type of building" field where we could not match the text to an item on Wikidata but know that the target property is P31.

We have created dumps of these (such as https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one is tiny) but are now looking for an easy way for users to consume them.

Does anyone know of a tool which could do this today? The Wikidata game only allows (AFAIK) for yes/no/skip whereas you would here want something like <enter_value>/invalid/skip. And if not are there any tools which with a bit of forking could be made to do it?

We have only published a few dumps but there are more to come. I would also imagine that this, or a similar, format could be useful for other imports/template harvests where some fields are more easily handled by humans.

Any thoughts and suggestions are welcome. Cheers, André André Costa | Senior Developer, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

fn＠imm.dtu.dk

4 Aug 4 Aug

10:23 a.m.

Dear André,

Great work you have done.

I am wondering whether you are aware of the issues around the Danish dataset and the clean up apparently required.

As far as I can determine the German Wikipedia has had a number of articles on Danish dolmens and they are also available on Wikidata. As far as I can see these items have not been linked with the new Swedish additions.

For instance, "Dolmen von Tornby" https://www.wikidata.org/wiki/Q1269335 has no Danish ID but is probably one of these items: https://www.wikidata.org/wiki/Q30240926 and https://www.wikidata.org/wiki/Q30240928 or https://www.wikidata.org/wiki/Q30114892 or https://www.wikidata.org/wiki/Q30114893 which the Alicia bot has added.

There are quite a lot of Danish dolmens on the German Wikipedia https://de.wikipedia.org/wiki/Kategorie:Gro%C3%9Fsteingrab_in_D%C3%A4nemark

I am sorry to present you with yet another problem. Perhaps the items can be matched by the geo-coordinate.

best regards Finn

On 08/04/2017 04:57 PM, André Costa wrote:

...

Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.

In the course of doing this we keep a note of the data which we fail to migrate. For each of these left-over bits we know which item and which property it belongs to as well as the source field and language from the Wikipedia list. An example would e.g. be a "type of building" field where we could not match the text to an item on Wikidata but know that the target property is P31.

We have created dumps of these (such as https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one is tiny) but are now looking for an easy way for users to consume them.

Does anyone know of a tool which could do this today? The Wikidata game only allows (AFAIK) for yes/no/skip whereas you would here want something like <enter_value>/invalid/skip. And if not are there any tools which with a bit of forking could be made to do it?

We have only published a few dumps but there are more to come. I would also imagine that this, or a similar, format could be useful for other imports/template harvests where some fields are more easily handled by humans.

Any thoughts and suggestions are welcome. Cheers, André André Costa |Senior Developer, Wikimedia Sverige |Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se |+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Nick Wilson (Quiddity)

7 Aug 7 Aug

12:57 p.m.

On Fri, Aug 4, 2017 at 10:57 AM, André Costa andre.costa@wikimedia.se wrote:

...

Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.

In the course of doing this we keep a note of the data which we fail to migrate. For each of these left-over bits we know which item and which property it belongs to as well as the source field and language from the Wikipedia list. An example would e.g. be a "type of building" field where we could not match the text to an item on Wikidata but know that the target property is P31.

We have created dumps of these (such as https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one is tiny) but are now looking for an easy way for users to consume them.

Does anyone know of a tool which could do this today? The Wikidata game only allows (AFAIK) for yes/no/skip whereas you would here want something like <enter_value>/invalid/skip. And if not are there any tools which with a bit of forking could be made to do it?

(IANADeveloper, but) I believe Wikidata Game might handle this? E.g. The "Date" game has fields for dates http://storage8.static.itmages.com/i/17/0807/h_1502126752_6195952_63d5e0e3da... http://storage5.static.itmages.com/i/17/0807/h_1502126720_7252323_c4174b3da6... https://tools.wmflabs.org/wikidata-game/#mode=no_date I forget if any of the Distributed Games have similar functionality (and no time to check now). Hope that helps!

...

We have only published a few dumps but there are more to come. I would also imagine that this, or a similar, format could be useful for other imports/template harvests where some fields are more easily handled by humans.

Any thoughts and suggestions are welcome. Cheers, André André Costa | Senior Developer, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Nick Wilson (Quiddity) Community Liaison, Wikimedia Foundation

Antonin Delpeuch (lists)

1:36 p.m.

Hi!

That reminds me of the crowdsourcing extension that LODrefine has - it lets you crowdsource the manual part of the reconciliation process. But it uses CrowdFlower for that (which is quite pricy). It'd be great if Wikidata Game could evolve into a decent Wikimedia-focused alternative to this sort of service - but that would be a lot of work.

Does anybody know an alternative to CrowdFlower that can be used for free with volunteer workers?

Antonin

On 04/08/2017 15:57, André Costa wrote:

...

Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.

In the course of doing this we keep a note of the data which we fail to migrate. For each of these left-over bits we know which item and which property it belongs to as well as the source field and language from the Wikipedia list. An example would e.g. be a "type of building" field where we could not match the text to an item on Wikidata but know that the target property is P31.

We have created dumps of these (such as https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one is tiny) but are now looking for an easy way for users to consume them.

Does anyone know of a tool which could do this today? The Wikidata game only allows (AFAIK) for yes/no/skip whereas you would here want something like <enter_value>/invalid/skip. And if not are there any tools which with a bit of forking could be made to do it?

We have only published a few dumps but there are more to come. I would also imagine that this, or a similar, format could be useful for other imports/template harvests where some fields are more easily handled by humans.

Any thoughts and suggestions are welcome. Cheers, André André Costa |Senior Developer, Wikimedia Sverige | Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se | +46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Marco Fossati

8 Aug 8 Aug

4:13 a.m.

Hi Antonin,

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

...

Does anybody know an alternative to CrowdFlower that can be used for free with volunteer workers?

There you go: https://crowdcrafting.org/ Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks that keep third-party data providers away from imports to Wikidata. Cheers,

Marco

Antonin Delpeuch (lists)

4:39 a.m.

On 08/08/2017 10:13, Marco Fossati wrote:

...

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

...
Does anybody know an alternative to CrowdFlower that can be used for free with volunteer workers?

There you go: https://crowdcrafting.org/ Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks that keep third-party data providers away from imports to Wikidata.

Thanks a lot! This looks great indeed - and the backend (pybossa) is open source and very generic, that is awesome! That means we could quite easily run that on WMF Cloud.

I've created an issue here: https://github.com/sparkica/LODRefine/issues/25

Cheers, Antonin

Gerard Meijssen

6:57 a.m.

Hoi, Given that Wikidata has identifiers to many external sources the challenge of reconciliation is often less of a challenge for crowds and less of a challenge than it needs to be. A few examples; the OCLC maintains two distinct identifiers; VIAF and ISNI. They are both actively maintained. When we include VIAF numbers in Wikidata, there will be instances where the identifiers become redirects. The same is true for ISNI. When we have the latest VIAF numbers, the ISNI numbers are highly likely to be correct. (better than 95% - the minimum requirements for imports at ISNI)..

When we share our identifiers regularly, we will learn about redirects and gain the direct links. We shared our identiers and VIAF identifiers with the Open Library. They now include them and in return we received a file that helped us depuplicate our Open Library identifiers and replace the redirects. What is infuriating is that there are Open Library identifiers hidden in the Freebase data. They cannot be exported, we can not send them to OL for processing and import them in Wikidata. We do a subpar job as a consequence.

Another project where we will gain information from multiple sources is the Biodiversity Heritage Library. We may gain links through their collaboration with the Internet Archive and the OCLC. This will reduce the chances for the introduction of duplicates at our end because of shared identifiers. I will also reduce the amount or people we have to process before they are included in Wikidata. It will allow for both OCLC, BHL and IA to learn of identifiers as we have them allowing for subsequent improvement is quality in the future for all of us.

So in my opinion we should agressively share identifiers, collaborate and seek the redirects and replace them and become more and more a focal point for links between resources. Thanks, GerardM

On 8 August 2017 at 11:13, Marco Fossati fossati@spaziodati.eu wrote:

...

Hi Antonin,

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

...
Does anybody know an alternative to CrowdFlower that can be used for free with volunteer workers?

There you go: https://crowdcrafting.org/ Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks that keep third-party data providers away from imports to Wikidata. Cheers,

Marco

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thad Guidry

9:07 a.m.

Gerard,

Sure working with linked data is great. But sometimes data is not linked at all and has no identifiers...

That's where the work Antonin is doing with OpenRefine helps with reconciling when even there are no identifiers other than a name. Many datasets only have Strings as Things. In fact, I'd say its quite useful to not only *add additional statements about existing Things* we already have, but *also adding more Things* in the world that have yet to be included in a database like Wikidata where no identifiers have been created yet for that Thing.

And I'm just as upset as you are about the goldmine of data still locked up in Freebase. But don't worry, baby steps, and eventually that data will make its way into Wikidata. Getting the Primary Sources tool up to par is a big step towards that, but certainly not the end of the line.

-Thad +ThadGuidry https://www.google.com/+ThadGuidry

On Tue, Aug 8, 2017 at 6:57 AM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, Given that Wikidata has identifiers to many external sources the challenge of reconciliation is often less of a challenge for crowds and less of a challenge than it needs to be. A few examples; the OCLC maintains two distinct identifiers; VIAF and ISNI. They are both actively maintained. When we include VIAF numbers in Wikidata, there will be instances where the identifiers become redirects. The same is true for ISNI. When we have the latest VIAF numbers, the ISNI numbers are highly likely to be correct. (better than 95% - the minimum requirements for imports at ISNI)..

When we share our identifiers regularly, we will learn about redirects and gain the direct links. We shared our identiers and VIAF identifiers with the Open Library. They now include them and in return we received a file that helped us depuplicate our Open Library identifiers and replace the redirects. What is infuriating is that there are Open Library identifiers hidden in the Freebase data. They cannot be exported, we can not send them to OL for processing and import them in Wikidata. We do a subpar job as a consequence.

Another project where we will gain information from multiple sources is the Biodiversity Heritage Library. We may gain links through their collaboration with the Internet Archive and the OCLC. This will reduce the chances for the introduction of duplicates at our end because of shared identifiers. I will also reduce the amount or people we have to process before they are included in Wikidata. It will allow for both OCLC, BHL and IA to learn of identifiers as we have them allowing for subsequent improvement is quality in the future for all of us.

So in my opinion we should agressively share identifiers, collaborate and seek the redirects and replace them and become more and more a focal point for links between resources. Thanks, GerardM

On 8 August 2017 at 11:13, Marco Fossati fossati@spaziodati.eu wrote:

...
Hi Antonin,

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

...
Does anybody know an alternative to CrowdFlower that can be used for free with volunteer workers?

There you go: https://crowdcrafting.org/ Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks that keep third-party data providers away from imports to Wikidata. Cheers,

Marco

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

2683

Age (days ago)

2687

Last active (days ago)

wikidata@lists.wikimedia.org

7 comments

7 participants

tags (0)

participants (7)

André Costa
Antonin Delpeuch (lists)
fn＠imm.dtu.dk
Gerard Meijssen
Marco Fossati
Nick Wilson (Quiddity)
Thad Guidry