Status and ETA External ID conversion

List overview All Threads
Download

newer

older

weekly summary #208

Sparql query for "evidence based...

Markus Krötzsch

5 Mar 2016 5 Mar '16

5:14 p.m.

Hi,

I noticed that many id properties still use the string datatype (including extremely frequent ids like https://www.wikidata.org/wiki/Property:P213 and https://www.wikidata.org/wiki/Property:P227).

Why is the conversion so slow, and when is it supposed to be completed?

Cheers,

Markus

Show replies by date

Katie Filbert

5 Mar 5 Mar

5:20 p.m.

On Sat, Mar 5, 2016 at 11:14 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Hi,

I noticed that many id properties still use the string datatype (including extremely frequent ids like https://www.wikidata.org/wiki/Property:P213 and https://www.wikidata.org/wiki/Property:P227).

Why is the conversion so slow, and when is it supposed to be completed?

The community is checking each property to verify it should be converted:

https://www.wikidata.org/wiki/User:Addshore/Identifiers/0

https://www.wikidata.org/wiki/User:Addshore/Identifiers/1

https://www.wikidata.org/wiki/User:Addshore/Identifiers/2

I'm sure help is welcome in checking properties.

and then we convert them in batches.

Cheers, Katie

...

Cheers,

Markus

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Katie Filbert Wikidata Developer Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0 http://wikimedia.de Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.

Markus Krötzsch

7:26 p.m.

Thanks, Katie. I see that the external ID datatype does not work as planed. At least I thought the original idea was to clean up the UI by moving hard-to-understand string IDs to a separate section. From the discussions on these pages, I see that the community uses criteria that are completely unrelated to UI aspects, but have something to do with the degree to which the property encodes a one-to-one mapping. I guess this is also valid, but won't be useful for UI purposes. I will need to use another solution for my case then.

Markus

On 05.03.2016 11:20, Katie Filbert wrote:

...

On Sat, Mar 5, 2016 at 11:14 AM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Hi,

I noticed that many id properties still use the string datatype
(including extremely frequent ids like
https://www.wikidata.org/wiki/Property:P213 and
https://www.wikidata.org/wiki/Property:P227).

Why is the conversion so slow, and when is it supposed to be completed?
The community is checking each property to verify it should be converted:

https://www.wikidata.org/wiki/User:Addshore/Identifiers/0

https://www.wikidata.org/wiki/User:Addshore/Identifiers/1

https://www.wikidata.org/wiki/User:Addshore/Identifiers/2

I'm sure help is welcome in checking properties.

and then we convert them in batches.

Cheers, Katie
Cheers,

Markus

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Katie Filbert Wikidata Developer

Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0

http://wikimedia.de

Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Luca Martinelli

8:30 p.m.

2016-03-05 13:26 GMT+01:00 Markus Krötzsch markus@semantic-mediawiki.org:

...

Thanks, Katie. I see that the external ID datatype does not work as planed. At least I thought the original idea was to clean up the UI by moving hard-to-understand string IDs to a separate section. From the discussions on these pages, I see that the community uses criteria that are completely unrelated to UI aspects, but have something to do with the degree to which the property encodes a one-to-one mapping. I guess this is also valid, but won't be useful for UI purposes. I will need to use another solution for my case then.

My2c, sorry if I'm going offtopic.

My impression on some properties is that we're probably underestimating some problems that are independent from our will, such as: * the possibility that the original catalogue might have some duplicates, and we can actually help the original catalogue to correct this issue; * the possibility that the Wikimedia approach and the catalogue's approach might bring one of the two sides to define something as two different things, while the other sides comprises it as a whole (for example, "palace+gardens"); * the possibility that some identifiers *are* standardised, but the authority did not published a single catalogue, leaving the single institutes to care for their own catalogue (for example, the International Standard Identifier for Libraries and Related Organizations, aka P791); * and so on.

Particularly the ISIL one is an important example to me, since I work for the Italian institution that actually is entitled to conduct the census of Italian libraries and assign the ISIL code to every and each library in Italy. There is no single world catalogue of that identifier? I really don't see it as a problem, as long as there it is at least one national authority that does that job. We're probably underestimating the fact that not everything has been standardised at a world level - and that we can live with that just fine.

Probably the threshold we set up for the conversion is too high, and this might be one of the causes why the whole process has slowed down to a dying pace.

Maarten Dammers

10:09 p.m.

Hi Luca,

Op 5-3-2016 om 14:30 schreef Luca Martinelli:

...

Probably the threshold we set up for the conversion is too high, and this might be one of the causes why the whole process has slowed down to a dying pace.

You call https://www.wikidata.org/wiki/Special:Contributions/Maintenance_script a dying pace?

Instead of complaining here people should participate in https://www.wikidata.org/wiki/User:Addshore/Identifiers/0 . Still plenty of easy properties that are clearly distinct, unique and have an external url. It doesn't make sense to discus the more complicated cases if we haven't gotten the easy cases out of the way yet.

Maarten

Luca Martinelli

10:45 p.m.

2016-03-05 16:09 GMT+01:00 Maarten Dammers maarten@mdammers.nl:

...

Hi Luca,

Op 5-3-2016 om 14:30 schreef Luca Martinelli:

...
Probably the threshold we set up for the conversion is too high, and this might be one of the causes why the whole process has slowed down to a dying pace.

You call https://www.wikidata.org/wiki/Special:Contributions/Maintenance_script a dying pace?

Instead of complaining here people should participate in https://www.wikidata.org/wiki/User:Addshore/Identifiers/0 . Still plenty of easy properties that are clearly distinct, unique and have an external url. It doesn't make sense to discus the more complicated cases if we haven't gotten the easy cases out of the way yet.

Point taken, I apologise for using too dramatic tones.

Nonetheless, I stick to the point that probably a ">99% unique identifier" threshold is too high. Just to make another example (disclaimer: I asked for this property since it is yet another catalogue that my institution runs), P1949 has not been converted to identifier because it has "only 98.82% unique out of 507 uses", that translates in only *six* cases out of 505 items which have two P1949 identifiers.

More, I did not intervene because of my blatant conflict of interest AND because I do not know with who discuss this and where, not even the general "what is an identifier" discussion. Probably there is a place where this discussion is going on, and I apologise again for not knowing (though I have some pretty good excuses), and I'm serious when I say that I'd be thankful to you if you please can point me in the general direction of where this is happening. :) (https://www.wikidata.org/wiki/User:Addshore/Identifiers maybe? Though that discussion seems to be pretty blocked)

Maarten Dammers

6 Mar 6 Mar

4:11 a.m.

Hi Luca,

Op 5-3-2016 om 16:45 schreef Luca Martinelli:

...

Point taken, I apologise for using too dramatic tones.

Looks like more people are eager to get this over with and can't wait to get everything converted

...

Nonetheless, I stick to the point that probably a ">99% unique identifier" threshold is too high. Just to make another example (disclaimer: I asked for this property since it is yet another catalogue that my institution runs), P1949 has not been converted to identifier because it has "only 98.82% unique out of 507 uses", that translates in only *six* cases out of 505 items which have two P1949 identifiers.

That's correct. As I said in my previous email: We're first doing the easy properties. You can see the easy properties at https://www.wikidata.org/wiki/User:ArthurPSmith/Identifiers/1 . The easy ones are the ones that have 99%+ single value and 99%+ unique. Compare that with https://www.wikidata.org/wiki/User:Addshore/Identifiers/1 and you'll notice we still have loads of easy ones we have to process (the unchecked list is still quite long).

Once we get those out of the way, we'll get to the more difficult ones. I prefer quality over speed here. I don't expect any problems with converting P1949.

Maarten

Andy Mabbett

9 Mar 9 Mar

4:29 a.m.

On 5 March 2016 at 15:09, Maarten Dammers maarten@mdammers.nl wrote:

...

You call https://www.wikidata.org/wiki/Special:Contributions/Maintenance_script a dying pace?

Only twelve items converted, in the first 8 days of March; and none since the 2nd...

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Lydia Pintscher

4:43 a.m.

On Tue, Mar 8, 2016 at 10:31 PM Andy Mabbett andy@pigsonthewing.org.uk wrote:

...

On 5 March 2016 at 15:09, Maarten Dammers maarten@mdammers.nl wrote:

...
You call https://www.wikidata.org/wiki/Special:Contributions/Maintenance_script a dying pace?

Only twelve items converted, in the first 8 days of March; and none since the 2nd...

Yes because until 2 days ago there wasn't more to convert. Marius will be back from learning for his exams in the next days and then we'll continue.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Andy Mabbett

5:15 a.m.

" On 8 March 2016 at 21:43, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...

On Tue, Mar 8, 2016 at 10:31 PM Andy Mabbett andy@pigsonthewing.org.uk wrote:

...
On 5 March 2016 at 15:09, Maarten Dammers maarten@mdammers.nl wrote:

...
You call https://www.wikidata.org/wiki/Special:Contributions/Maintenance_script a dying pace?

Only twelve items converted, in the first 8 days of March; and none since the 2nd...

Yes because until 2 days ago there wasn't more to convert. Marius will be back from learning for his exams in the next days and then we'll continue.

This was in response to the comment "the whole process has slowed down to a dying pace"; not a criticism of Marius. That "there wasn't more to convert" suggests that the original comment was well-founded.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Lydia Pintscher

5 Mar 5 Mar

8:45 p.m.

On Sat, Mar 5, 2016 at 1:28 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Thanks, Katie. I see that the external ID datatype does not work as planed. At least I thought the original idea was to clean up the UI by moving hard-to-understand string IDs to a separate section. From the discussions on these pages, I see that the community uses criteria that are completely unrelated to UI aspects, but have something to do with the degree to which the property encodes a one-to-one mapping. I guess this is also valid, but won't be useful for UI purposes. I will need to use another solution for my case then.

Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Cheers Lydia

Markus Krötzsch

8:54 p.m.

On 05.03.2016 14:45, Lydia Pintscher wrote:

...

On Sat, Mar 5, 2016 at 1:28 PM Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Thanks, Katie. I see that the external ID datatype does not work as
planed. At least I thought the original idea was to clean up the UI by
moving hard-to-understand string IDs to a separate section. From the
discussions on these pages, I see that the community uses criteria that
are completely unrelated to UI aspects, but have something to do with
the degree to which the property encodes a one-to-one mapping. I guess
this is also valid, but won't be useful for UI purposes. I will need to
use another solution for my case then.
Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Ok, let's see what happens. I am not saying that the other criteria applied now in the discussions are bad. It's just another use of the datatype than I would have expected.

Markus

...

Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

David Cuenca Tudela

9 p.m.

Markus, you are not the only one, I am also skeptical about the criteria used. For me the main problem is perhaps the misunderstanding that the "external identifier" label creates, actually what I was expecting was something more like "external references", a place where to put all the external sources to wikidata in one place. But we'll see how it goes.

Cheers, Micru

On Sat, Mar 5, 2016 at 2:54 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

On 05.03.2016 14:45, Lydia Pintscher wrote:

...
On Sat, Mar 5, 2016 at 1:28 PM Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Thanks, Katie. I see that the external ID datatype does not work as
planed. At least I thought the original idea was to clean up the UI by
moving hard-to-understand string IDs to a separate section. From the
discussions on these pages, I see that the community uses criteria
that are completely unrelated to UI aspects, but have something to do with the degree to which the property encodes a one-to-one mapping. I guess this is also valid, but won't be useful for UI purposes. I will need to use another solution for my case then.

Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.
Ok, let's see what happens. I am not saying that the other criteria applied now in the discussions are bad. It's just another use of the datatype than I would have expected.

Markus

...
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Etiamsi omnes, ego non

Egon Willighagen

9:16 p.m.

Hi Lydia, all,

On Sat, Mar 5, 2016 at 2:54 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

On 05.03.2016 14:45, Lydia Pintscher wrote:

...
Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Ok, let's see what happens. I am not saying that the other criteria applied now in the discussions are bad. It's just another use of the datatype than I would have expected.

I'm one of the people who noticed the separation and indeed wondered why some of the chemistry-related identifiers I tagged and added in the long lists of identifiers were not included yet...

What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Originally, I though the idea was just to remove/leave/add them in/to the list, but people started making comments now. I will do this more explicitly now. Also for the IDs I added.

Egon

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

Lydia Pintscher

9:25 p.m.

On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com wrote:

...

Hi Lydia, all,

On Sat, Mar 5, 2016 at 2:54 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
On 05.03.2016 14:45, Lydia Pintscher wrote:

...
Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Ok, let's see what happens. I am not saying that the other criteria

applied

...
now in the discussions are bad. It's just another use of the datatype

than I

...
would have expected.

I'm one of the people who noticed the separation and indeed wondered why some of the chemistry-related identifiers I tagged and added in the long lists of identifiers were not included yet...

What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

Cheers Lydia

Egon Willighagen

9:35 p.m.

On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...

On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

...

Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

11:15 p.m.

Hi,

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea. I wonder what good this may do. In particular, anybody who cares about uniqueness can easily determine it from the data without any property type that says this.

Markus

On 05.03.2016 15:35, Egon Willighagen wrote:

...

On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...
On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

...
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

James Heald

6 Mar 6 Mar

12:20 a.m.

Just do them all, as fast as the bot can go.

Revert them /if/ somebody complains (which is unlikely).

Make this a process of having to contract out for an identifier /not/ to be done, rather than having to contract in for it to be done.

Personally, I am rather more interested in what happens next, after the datatype-renaming stage is done.

How does the external-ID datatype then evolve?

How does it cope with a external ID possibly having a short-form representation, a URL for humans (currently specified by P1630 for the group as a whole), a URL for RDF (currently specified by P1921 for the group as a whole), also sometimes a locally preferred name, or a locally disambiguated name in the external source.

What becomes its wdt: value for SPARQL?

What other object-values will get hung off its detailed statement form ?

What will specified using qualifiers?

Some more clarifications of current forward thinking on this might also help with people's concerns about how to respond to departures from strict 1-to-1-ness in the mappings (whether many-to-one or one-to-many).

-- James.

On 05/03/2016 16:15, Markus Krötzsch wrote:

...

Hi,

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea. I wonder what good this may do. In particular, anybody who cares about uniqueness can easily determine it from the data without any property type that says this.

Markus

On 05.03.2016 15:35, Egon Willighagen wrote:

...
On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...
On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

...
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andy Mabbett

1:56 a.m.

On 5 March 2016 at 16:15, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea.

Then we should create a Wikidata item for each concept on that external database.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Gerard Meijssen

2:28 a.m.

Hoi, Lets take things slowly. It is vital that we get Wikipedia well connected first. Plenty of challenges there. If we concentrate on what Wikipedia needs in all its languages, we will get a perspective of what is notable for us. Other sources have their criteria.. Thanks, GerardM

On 5 March 2016 at 19:56, Andy Mabbett andy@pigsonthewing.org.uk wrote:

...

On 5 March 2016 at 16:15, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
I agree with Egon that the uniqueness requirement is rather weird. What

it

...
means is that a thing is only considered an "identifier" if it points to

a

...
database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier",

according

...
to the uniqueness idea.

Then we should create a Wikidata item for each concept on that external database.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

4:17 p.m.

Another reason why "uniqueness" is not such a good criterion: it cannot be applied to decide the type of a newly created property (no statements, no uniqueness score). In general, the fewer statements there are for a property, the more likely they are to be unique. The criterion rewards data incompleteness (example: if Luca deletes the six multiple ids he mentioned, then the property could be converted -- and he could later add the statements again). If you think about it, it does not seem like a very good idea to make the datatype of a property depend on its current usage in Wikidata.

Markus

On 05.03.2016 17:15, Markus Krötzsch wrote:

...

Hi,

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea. I wonder what good this may do. In particular, anybody who cares about uniqueness can easily determine it from the data without any property type that says this.

Markus

On 05.03.2016 15:35, Egon Willighagen wrote:

...
On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...
On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

...
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Magnus Manske

6:27 p.m.

Agreed. In Mix'n'match, 145 out of 177catalogs have at least one instance of two or more external IDs matched to a single Wikidata item. External datasets, even curated ones, are messy.

Maybe the criterion should be "intended to be unique", or somesuch.

On Sun, Mar 6, 2016 at 9:18 AM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Another reason why "uniqueness" is not such a good criterion: it cannot be applied to decide the type of a newly created property (no statements, no uniqueness score). In general, the fewer statements there are for a property, the more likely they are to be unique. The criterion rewards data incompleteness (example: if Luca deletes the six multiple ids he mentioned, then the property could be converted -- and he could later add the statements again). If you think about it, it does not seem like a very good idea to make the datatype of a property depend on its current usage in Wikidata.

Markus

On 05.03.2016 17:15, Markus Krötzsch wrote:

...
Hi,

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea. I wonder what good this may do. In particular, anybody who cares about uniqueness can easily determine it from the data without any property type that says this.

Markus

On 05.03.2016 15:35, Egon Willighagen wrote:

...
On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...
On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

...
...
What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

...
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tom Morris

11:37 p.m.

If an identifier system provides for merging of entities along with the retention of both their previous IDs (as all good identifier systems which guarantee stable identifiers should), duplicate IDs are inevitable. Well known examples include Freebase, MusicBrainz, OpenLibrary, and yes, even Wikipedia & Wikidata. Duplicates may be silently resolved as is the case with Freebase, redirects like OpenLibrary and Wiki*, or a hybrid like MusicBrainz (some page types redirect, others don't). Merged identities may be relatively rare (Freebase) or more common (OpenLibrary, MusicBrainz), but they'll always happen. Mandating uniqueness would force the "losing" IDs to be deleted from Wikidata, losing the benefit that they bring for enhancing and strengthening the mesh of identifiers.

I've looked at the identifier list a couple of times with an eye towards helping with the curation, but I could never make heads nor tails of what the criteria were, whether there was consensus about the criteria, why some perfectly acceptably identifiers were being vehemently argued against and one what grounds, etc. The "community" driving this process on those wiki pages seems to be just a handful of vocal and opinionated people. Is that going to generate good results?

Tom

On Sun, Mar 6, 2016 at 4:17 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Another reason why "uniqueness" is not such a good criterion: it cannot be applied to decide the type of a newly created property (no statements, no uniqueness score). In general, the fewer statements there are for a property, the more likely they are to be unique. The criterion rewards data incompleteness (example: if Luca deletes the six multiple ids he mentioned, then the property could be converted -- and he could later add the statements again). If you think about it, it does not seem like a very good idea to make the datatype of a property depend on its current usage in Wikidata.

Markus

On 05.03.2016 17:15, Markus Krötzsch wrote:

...
Hi,

I agree with Egon that the uniqueness requirement is rather weird. What it means is that a thing is only considered an "identifier" if it points to a database that uses a similar granularity for modelling the world as Wikidata. If the external database is more fine-grained than Wikidata (several ids for one item), then it is not a valid "identifier", according to the uniqueness idea. I wonder what good this may do. In particular, anybody who cares about uniqueness can easily determine it from the data without any property type that says this.

Markus

On 05.03.2016 15:35, Egon Willighagen wrote:

...
On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...
On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen egon.willighagen@gmail.com

...
What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Add them to the list Katie linked if you think they should be converted. We wait a bit to see if anyone disagrees and I also do a quick sanity check for each property myself before conversion.

I am adding comments for now. I am also looking at the comments for what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_e...

What is the resolution in these? There are some strong, often contradiction, opinions...

For example, the uniqueness requirement is interesting... if an identifier must be unique for a single Wikidata entry, this is effectively disqualifying most identifiers used in the life sciences... simply because Wikidata rarely has the exact same concept in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but consider a gene: the concept of a gene in Wikidata is not like a gene sequence in a DNA sequence database. Hence, an identifier from that database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs, metabolites, etc). I already commented on CAS (P231) and InChI (P234), both are used as identifier, but none are unique to concepts used as "types" in Wikidata. The CAS for formaldehyde and formaline is identical. The InChI may be unique, but only of you strongly type the definition of a chemical graph instead of a substance (as is now)... etc.

So, in order to make a decision which chemical identifiers should be marked as "identifier" type depends on resolution of those required characteristics...

Can you please inform me about the state of those characteristics (accepted or declined)?

Egon

Cheers

...
Lydia

Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andy Mabbett

9 Mar 9 Mar

4:48 a.m.

On 6 March 2016 at 16:37, Tom Morris tfmorris@gmail.com wrote:

...

I've looked at the identifier list a couple of times with an eye towards helping with the curation, but I could never make heads nor tails of what the criteria were, whether there was consensus about the criteria, why some perfectly acceptably identifiers were being vehemently argued against and one what grounds, etc. The "community" driving this process on those wiki pages seems to be just a handful of vocal and opinionated people. Is that going to generate good results?

No - and you're correct in your assessment.

We haven't even managed to convert ISBN-10, ISBN-13, ISSN, or various ISO International Standard identifiers.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Andy Mabbett

6 Mar 6 Mar

1:53 a.m.

On 5 March 2016 at 14:25, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...

I also do a quick sanity check for each property myself before conversion.

You might also like to do a sanity check on those marked as not suitable for conversion.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Egon Willighagen

5 Mar 5 Mar

9:42 p.m.

Mmm... I previously added a few chemical identifiers, like KEGG, ChEBI, DrugBank, but I cannot find them anymore... :/

Egon

On Sat, Mar 5, 2016 at 3:16 PM, Egon Willighagen egon.willighagen@gmail.com wrote:

...

Hi Lydia, all,

On Sat, Mar 5, 2016 at 2:54 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
On 05.03.2016 14:45, Lydia Pintscher wrote:

...
Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Ok, let's see what happens. I am not saying that the other criteria applied now in the discussions are bad. It's just another use of the datatype than I would have expected.

I'm one of the people who noticed the separation and indeed wondered why some of the chemistry-related identifiers I tagged and added in the long lists of identifiers were not included yet...

What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Originally, I though the idea was just to remove/leave/add them in/to the list, but people started making comments now. I will do this more explicitly now. Also for the IDs I added.

Egon

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

Egon Willighagen

9:45 p.m.

Never mind. I found these in already done.

Egon

On Sat, Mar 5, 2016 at 3:42 PM, Egon Willighagen egon.willighagen@gmail.com wrote:

...

Mmm... I previously added a few chemical identifiers, like KEGG, ChEBI, DrugBank, but I cannot find them anymore... :/

Egon

On Sat, Mar 5, 2016 at 3:16 PM, Egon Willighagen egon.willighagen@gmail.com wrote:

...
Hi Lydia, all,

On Sat, Mar 5, 2016 at 2:54 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
On 05.03.2016 14:45, Lydia Pintscher wrote:

...
Give it another 2 to 3 weeks and it'll get there. More and more editors are exposed to the separation in the UI now and start noticing the ones that intuitively should be moved into the identifier section.

Ok, let's see what happens. I am not saying that the other criteria applied now in the discussions are bad. It's just another use of the datatype than I would have expected.

I'm one of the people who noticed the separation and indeed wondered why some of the chemistry-related identifiers I tagged and added in the long lists of identifiers were not included yet...

What is the exact process? Do you just plan to wait longer to see if anyone supports/contradicts my tagging? Should I get other Wikidata users and contributors to back up my suggestion?

Originally, I though the idea was just to remove/leave/add them in/to the list, but people started making comments now. I will do this more explicitly now. Also for the IDs I added.

Egon

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

Stas Malyshev

7 Mar 7 Mar

4:56 a.m.

Hi!

...

The community is checking each property to verify it should be converted:

https://www.wikidata.org/wiki/User:Addshore/Identifiers/0

https://www.wikidata.org/wiki/User:Addshore/Identifiers/1

https://www.wikidata.org/wiki/User:Addshore/Identifiers/2

Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody? What are the rules for "disputed" - is some process for review planned?

I think some more definite statement would help, especially to people willing to contribute.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

5 a.m.

On 06.03.2016 22:56, Stas Malyshev wrote:

...

Hi!

...
The community is checking each property to verify it should be converted:

https://www.wikidata.org/wiki/User:Addshore/Identifiers/0

https://www.wikidata.org/wiki/User:Addshore/Identifiers/1

https://www.wikidata.org/wiki/User:Addshore/Identifiers/2

Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody? What are the rules for "disputed" - is some process for review planned?

I think some more definite statement would help, especially to people willing to contribute.

+1 I have had the same questions.

In your case, however, the answer probably is: you cannot contribute there at all, since you are a Wikimedia employee and this is a content-related community discussion. ;-)

Best,

Markus

...

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Stas Malyshev

5:31 a.m.

Hi!

...

In your case, however, the answer probably is: you cannot contribute there at all, since you are a Wikimedia employee and this is a content-related community discussion. ;-)

Many WMF employees contribute to wikis in their non-work time, as far as I know. I don't even seek to participate in the discussion (though I don't think WMF employment would disqualify me from contributing in volunteer capacity, given my affiliations - as they are - are clearly stated) - but only to know the results so I could contribute in editor capacity, following whatever rules are there.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

3:18 p.m.

On 06.03.2016 23:31, Stas Malyshev wrote:

...

Hi!

...
In your case, however, the answer probably is: you cannot contribute there at all, since you are a Wikimedia employee and this is a content-related community discussion. ;-)

Many WMF employees contribute to wikis in their non-work time, as far as I know. I don't even seek to participate in the discussion (though I don't think WMF employment would disqualify me from contributing in volunteer capacity, given my affiliations - as they are - are clearly stated) - but only to know the results so I could contribute in editor capacity, following whatever rules are there.

Yes, sure, your free time is a different matter. I just thought you are speaking as a WMF employee here, since you were using this email. I am probably over-sensitive there since I am used to the very strict policies of WMDE. They are very careful to keep paid and private activities separate by using different accounts.

Markus

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Stas Malyshev

3:38 p.m.

Hi!

...

Yes, sure, your free time is a different matter. I just thought you are speaking as a WMF employee here, since you were using this email. I am

It's Sunday here, so no :) I do use two separate logins for WMF official and volunteer work on Wiki, but using two emails is too cumbersome for me.

...

probably over-sensitive there since I am used to the very strict policies of WMDE. They are very careful to keep paid and private activities separate by using different accounts.

Surely, it is common in WMF too. But again two email accounts seems excessive to me. Usually it's pretty clear from the context but if needed, I will clarify.

-- Stas Malyshev smalyshev@wikimedia.org

Pine W

3:58 p.m.

This use of a WMF email account raises some legal and wikipolitical ambiguities that are best avoided. I strongly recommend using a non-WMF email account for anyone who is speaking outside of a WMF role. Pinging James to ask for clarification on the policy. And let's fork this portion of the discussion. (:

Pine On Mar 7, 2016 00:39, "Stas Malyshev" smalyshev@wikimedia.org wrote:

...

Hi!

...
Yes, sure, your free time is a different matter. I just thought you are speaking as a WMF employee here, since you were using this email. I am

It's Sunday here, so no :) I do use two separate logins for WMF official and volunteer work on Wiki, but using two emails is too cumbersome for me.

...
probably over-sensitive there since I am used to the very strict policies of WMDE. They are very careful to keep paid and private activities separate by using different accounts.

Surely, it is common in WMF too. But again two email accounts seems excessive to me. Usually it's pretty clear from the context but if needed, I will clarify. -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Lydia Pintscher

5:31 a.m.

On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody?

Yes. Good ones should be moved to good to convert. If no-one disagrees we'll convert them.

...

What are the rules for "disputed" - is some process for review planned?

Let's concentrate on the ones people can agree on for now. We'll tackle the ones that are disputed in the next step. If editors can't sort it out I will make an executive decision at some point but I don't think this will be needed.

Cheers Lydia

Tom Morris

8:56 a.m.

On Sun, Mar 6, 2016 at 5:31 PM, Lydia Pintscher < Lydia.Pintscher@wikimedia.de> wrote:

...

On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody?

Yes. Good ones should be moved to good to convert. If no-one disagrees we'll convert them.

So, no decision criteria? Just whatever we individually like?

What are the rules for "disputed" - is some process for review planned?

...

...
Let's concentrate on the ones people can agree on for now. We'll tackle the ones that are disputed in the next step. If editors can't sort it out I will make an executive decision at some point but I don't think this will be needed.

I think the fact that some obvious good identifiers like IMDb have been blocked has made potential contributors unsure how to evaluate other candidates which would also, on the surface, seem obviously good.

Perhaps since the criteria aren't being used, someone could just delete all the proposed criteria from the page and replace the old text with something like "Whatever you, personally, think is best" so that people know what's expected of them? That might help break the logjam. I know it would make me more comfortable in contributing.

Tom

Lydia Pintscher

3:13 p.m.

On Mon, Mar 7, 2016 at 2:57 AM Tom Morris tfmorris@gmail.com wrote:

...

On Sun, Mar 6, 2016 at 5:31 PM, Lydia Pintscher < Lydia.Pintscher@wikimedia.de> wrote:

...
On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody?

Yes. Good ones should be moved to good to convert. If no-one disagrees we'll convert them.

So, no decision criteria? Just whatever we individually like?

What are the rules for "disputed" - is some process for review planned?

...
...
Let's concentrate on the ones people can agree on for now. We'll tackle the ones that are disputed in the next step. If editors can't sort it out I will make an executive decision at some point but I don't think this will be needed.

I think the fact that some obvious good identifiers like IMDb have been blocked has made potential contributors unsure how to evaluate other candidates which would also, on the surface, seem obviously good.

Perhaps since the criteria aren't being used, someone could just delete all the proposed criteria from the page and replace the old text with something like "Whatever you, personally, think is best" so that people know what's expected of them? That might help break the logjam. I know it would make me more comfortable in contributing.

Ok. I think we're making this much more complicated than necessary. The question you should ask yourself is: Does this identify a concept in another database/website/...? Nice to have: a website to link to. Once we have that we can look at corner cases and exceptions.

Cheers Lydia

Markus Kroetzsch

5:54 p.m.

On 07.03.2016 09:13, Lydia Pintscher wrote:

...

On Mon, Mar 7, 2016 at 2:57 AM Tom Morris <tfmorris@gmail.com mailto:tfmorris@gmail.com> wrote:

On Sun, Mar 6, 2016 at 5:31 PM, Lydia Pintscher
<Lydia.Pintscher@wikimedia.de <mailto:Lydia.Pintscher@wikimedia.de>>
wrote:

    On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev
    <smalyshev@wikimedia.org <mailto:smalyshev@wikimedia.org>> wrote:

        Is there a process somewhere of how the checking is done,
        what are
        criteria, etc.? I've read
        https://www.wikidata.org/wiki/User:Addshore/Identifiers but
        there's a
        lot of discussion but not clear if it ever come to some end.
        Also not
        clear what the process is - should I just move a property I
        like to
        "good to convert"? Should I run it through some checklist
        first? Should
        I ask somebody?


    Yes. Good ones should be moved to good to convert. If no-one
    disagrees we'll convert them.


So, no decision criteria? Just whatever we individually like?

        What are the rules for "disputed" - is some process for
        review planned?


    Let's concentrate on the ones people can agree on for now. We'll
    tackle the ones that are disputed in the next step. If editors
    can't sort it out I will make an executive decision at some
    point but I don't think this will be needed.


I think the fact that some obvious good identifiers like IMDb have
been blocked has made potential contributors unsure how to evaluate
other candidates which would also, on the surface, seem obviously good.

Perhaps since the criteria aren't being used, someone could just
delete all the proposed criteria from the page and replace the old
text with something like "Whatever you, personally, think is best"
so that people know what's expected of them? That might help break
the logjam. I know it would make me more comfortable in contributing.

The community actually already has a class for such properties:

"Wikidata property representing a unique identifier" http://www.wikidata.org/entity/Q19847637

In general, the community uses several classes for properties that could have been used for UI organisation, rather than introducing new datatypes. The current discussion is caused mainly by the fact that there is just *one* new datatype, but many types of identifiers based on different criteria -- so people argue which one the new datatype should represent. The classes used on properties are much less controversial, because one just have one for each criterion that people consider relevant. For example, there also is

"multi-source external identifier" http://www.wikidata.org/entity/Q21264328

There are many other classes that could be used in the interface, e.g., "Wikidata property for human relationships" http://www.wikidata.org/entity/Q22964231 that one could use very well to group properties. One would not need to use all classes to group properties: there would be a (short) list that the community would decide on. I think this is the best approach to get reasonable property groups Reasonator-style into Wikidata at some point. It works much better than creating new datatypes for each case, it can build on existing data (rather than starting new discussions on datatype conversion), and it has the advantage that it can also group properties of different types.

Markus

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Daniel Kinzler

10 Mar 10 Mar

3:57 a.m.

Am 07.03.2016 um 11:54 schrieb Markus Kroetzsch:

...

In general, the community uses several classes for properties that could have been used for UI organisation, rather than introducing new datatypes.

Technically, the main purpose of having a separate datatype was to explicity model values that identify a resource (in the RDF sense, where resource means "anything that can be identified unambiguously"), so we can apply mappings (e.g. to URIs and URLs) when exporting and displaying them.

Using the datatype for the UI structure is an attempt to kill two birds with one stone. I think it's a pretty good start, but I agree that we should revisit this once we have gathered some feedback. It would not be too hard to base the structure on different criteria (well, depends on the criteria).

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Stas Malyshev

6:09 a.m.

Hi!

...

Technically, the main purpose of having a separate datatype was to explicity model values that identify a resource (in the RDF sense, where resource means "anything that can be identified unambiguously"), so we can apply mappings (e.g. to URIs and URLs) when exporting and displaying them.

We need also to be careful here as we have some external ID-like types that do not translate to URIs. So we should either not convert them to that type, or we'd have complex logic of which type the corresponding properties should be (since we should tell whether this property uses string or URI, and we should be able to do it automatically when generating RDF).

Right now I'd suggest not converting such properties, unless there's a good reason to.

-- Stas Malyshev smalyshev@wikimedia.org

James Heald

6:49 a.m.

On 09/03/2016 23:09, Stas Malyshev wrote:

...

Hi!

...
Technically, the main purpose of having a separate datatype was to explicity model values that identify a resource (in the RDF sense, where resource means "anything that can be identified unambiguously"), so we can apply mappings (e.g. to URIs and URLs) when exporting and displaying them.

We need also to be careful here as we have some external ID-like types that do not translate to URIs. So we should either not convert them to that type, or we'd have complex logic of which type the corresponding properties should be (since we should tell whether this property uses string or URI, and we should be able to do it automatically when generating RDF).

Right now I'd suggest not converting such properties, unless there's a good reason to.

Somewhat related to what Stas writes, can I remind again that we have many properties that have single identifiers, that map to different URLs for different purposes (eg a URL for human readers, a slightly different URL for RDF).

Both of those URLs should be available /somewhere/ in the triplestore or the RDF dump -- but probably neither of them are what one would want the simple wdt: form of the property on SPARQL to return.

-- James.

Bene*

7:02 a.m.

Hey

Am 10.03.2016 um 00:49 schrieb James Heald:

...

On 09/03/2016 23:09, Stas Malyshev wrote:

...
Hi!

...
Technically, the main purpose of having a separate datatype was to explicity model values that identify a resource (in the RDF sense, where resource means "anything that can be identified unambiguously"), so we can apply mappings (e.g. to URIs and URLs) when exporting and displaying them.

We need also to be careful here as we have some external ID-like types that do not translate to URIs. So we should either not convert them to that type, or we'd have complex logic of which type the corresponding properties should be (since we should tell whether this property uses string or URI, and we should be able to do it automatically when generating RDF).

Right now I'd suggest not converting such properties, unless there's a good reason to.

In theory, having an identifier datatype and rendering strings as urls are two separate things. We could dispatch the rendering based on property_info and support the "formatter url" property for more values (eg. coordinates) without even having an identifier datatype. It is just a good idea to conceptually separate external identifiers from other string values.

I don't see why it is an issue that some external identifiers don't translate to URIs. What complex logic is involved here? In RDF we should just add the plain identifier like we have it now as the default value, and the expanded urls as derived values if available.

...

Somewhat related to what Stas writes, can I remind again that we have many properties that have single identifiers, that map to different URLs for different purposes (eg a URL for human readers, a slightly different URL for RDF).

Both of those URLs should be available /somewhere/ in the triplestore or the RDF dump -- but probably neither of them are what one would want the simple wdt: form of the property on SPARQL to return.

I agree that for the simple wdt: form we should still have the plain id without any expanded urls. For the derived values (full urls but also relevant for other data types), we still need to find a proper way to represent those derived data values in our data model. As soon as we tackle that issue, it will be possible to provide those urls in the api output as well as the serialized RDF.

Best regards Bene

Stas Malyshev

7:37 a.m.

Hi!

...

In theory, having an identifier datatype and rendering strings as urls are two separate things. We could dispatch the rendering based on property_info and support the "formatter url" property for more values (eg. coordinates) without even having an identifier datatype. It is just a good idea to conceptually separate external identifiers from other string values.

Correct in theory. In practice however if we create implication between the two, we need to be careful to not create cases where it would be hard for automatic tools to produce correct result.

...

I don't see why it is an issue that some external identifiers don't translate to URIs. What complex logic is involved here? In RDF we should just add the plain identifier like we have it now as the default value,

If we say "since external IDs are in fact URIs, since they refer to external databases, then let's mark them as URI property and render them as full URI - i.e. let's instead of:

wd:Q1000336 wdt:P646 "/m/03pvzn"

say this:

wd:Q1000336 wdt:P646 https://www.freebase.com/m/03pvzn

This may make a lot of sense, since the interesting URL that people would like to see may be the latter, and the former is kind of chopped-off form of it we use for our internal purposes. OTOH, what if it wasn't easy or possible to generate the latter from the former automatically? Then we need some logic to figure that out.

...

and the expanded urls as derived values if available.

What you mean by "derived values"?

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

4:26 p.m.

Dear all,

I am surprised by the amount of confusion in this discussion. There is absolutely no relationship between mapping of Wikidata values to URIs and the external id datatype. There is no reason in RDF or elsewhere why the two should be related.

(1) The mapping of Wikidata strings to URLs is controlled by the formatted URL (P1630) property and its qualifiers. (2) The mapping of Wikidata strings to URIs is controlled by the URI pattern for RDF resource (P1921) property and its qualifiers. (3) The external id datatype does not provide any mapping and the criteria used for it by the community do not imply that such mappings should exist for these cases, or that they should not exist for other cases.

We can add any amount of URLs (1) or URIs (2) to the RDF store without problems. We only need to make a pick which ones to use (this can be done using qualifiers, similar to the used by (P1535) qualifier that is already used for formatter URL if there are many). RDF imposes no requirements how these URLs/URIs should look, whether they are unique or not, or whether they are issued by a particular authority or not. There is no danger of confusion since URIs are by their very nature global IDs, so you can use many of them on one resource without any problems. None of the issues discussed in this thread seems to play much of a role in RDF or in RDF consumers.

I am most worried about Daniel's remark. He says that we wants to use external ids to identify properties with "values that identify a resource", but does not mention the existing, community-supported mechanism for doing just that (2), and instead proposes another mechanism (3), which the community is clearly not using for this purpose at all. In fact, there is no need for a technical change in Wikibase here: if we want external URIs in RDF, we just have to add them based on the data we find in P1921. If the developers don't like to follow the community in this case, they should explain their technical (!) concerns to the community and gather feedback.

Regards,

Markus

On 10.03.2016 01:37, Stas Malyshev wrote:

...

Hi!

...
In theory, having an identifier datatype and rendering strings as urls are two separate things. We could dispatch the rendering based on property_info and support the "formatter url" property for more values (eg. coordinates) without even having an identifier datatype. It is just a good idea to conceptually separate external identifiers from other string values.

Correct in theory. In practice however if we create implication between the two, we need to be careful to not create cases where it would be hard for automatic tools to produce correct result.

...
I don't see why it is an issue that some external identifiers don't translate to URIs. What complex logic is involved here? In RDF we should just add the plain identifier like we have it now as the default value,

If we say "since external IDs are in fact URIs, since they refer to external databases, then let's mark them as URI property and render them as full URI - i.e. let's instead of:

wd:Q1000336 wdt:P646 "/m/03pvzn"

say this:

wd:Q1000336 wdt:P646 https://www.freebase.com/m/03pvzn

This may make a lot of sense, since the interesting URL that people would like to see may be the latter, and the former is kind of chopped-off form of it we use for our internal purposes. OTOH, what if it wasn't easy or possible to generate the latter from the former automatically? Then we need some logic to figure that out.

...
and the expanded urls as derived values if available.

What you mean by "derived values"?

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Daniel Kinzler

11 Mar 11 Mar

12:43 a.m.

Am 10.03.2016 um 10:26 schrieb Markus Kroetzsch:

...

I am surprised by the amount of confusion in this discussion. There is absolutely no relationship between mapping of Wikidata values to URIs and the external id datatype.

You are correct that such a relationship does not necessarily follow from first principles. You are however incorrect in saying that there is no relationship in Wikibase: The way the data model is currently defined and the way mappings are implemented, we made a conscious decision to support such mappings only for ExternalId values.

I think it would help the discussion if we could keep apart: - what follows from formal principles - what you (or I) consider best - what the software currently does

...

(3) The external id datatype does not provide any mapping and the criteria used for it by the community do not imply that such mappings should exist for these cases, or that they should not exist for other cases.

That is incorrect from the way Wikibase defines and uses the ExternalId datatype: the intent is indeed to say that something is an identifier that can be mapped, and that such a (direct) mapping is not supported for other data types. (That doesn't mean we will not offer different mappings for other data types, perhaps URLs for looking up coordinates, etc).

Modeling this explicitly is indeed the reason to have this datatype.

...

I am most worried about Daniel's remark. He says that we wants to use external ids to identify properties with "values that identify a resource", but does not mention the existing, community-supported mechanism for doing just that (2), and instead proposes another mechanism (3), which the community is clearly not using for this purpose at all.

That's a misunderstanding. The plan is to support P1921 for URI mappings, and we already do support P1630 for URL mappings. But we intentionally do this only for ExternalId values, not for plain strings or other types.

So, the technical implementation does follow the community convention, with the restriction that properties that should use this kind of mapping need to explicitly be declared to be identifiers. We are also considering implementing validation and normalization for ExternalId values, but it's not clear yet how we can safely apply community supplied validation and normalization patterns.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Luiz Augusto

12:59 a.m.

tl;dr

As far I can see developers expect to properties being listed by the community, but the listing is kept in a way that the community normally looks as a draft waiting for complaints until some amount of time and later being implemented... communication issues, eh?

BTW I've listed some properties on [1], all of them examples of ID uniqueness being a non-issue.

The remaining user subpages may have also in the very exact scope, but I will wait until the 'community process expectation' versus 'this is a thing that will get done even no one says a word' gets clarified.

[1] - https://www.wikidata.org/w/index.php?diff=311688067

Markus Kroetzsch

5:20 p.m.

On 10.03.2016 18:43, Daniel Kinzler wrote:

...

Am 10.03.2016 um 10:26 schrieb Markus Kroetzsch:

...
I am surprised by the amount of confusion in this discussion. There is absolutely no relationship between mapping of Wikidata values to URIs and the external id datatype.

You are correct that such a relationship does not necessarily follow from first principles. You are however incorrect in saying that there is no relationship in Wikibase: The way the data model is currently defined and the way mappings are implemented, we made a conscious decision to support such mappings only for ExternalId values.

Maybe the community needs a bit more explanation as to why you "consciously" decide to override their judgement. The use of property P1921 clearly tells you what the community wants. If we want to have URIs only for some subset of properties, then we will use P1921 only on a subset. It is very easy and gives us complete control. The use of ExternalId as an additional restricting mechanism is neither helpful nor desired. We can decide for ourselves which properties should have URIs exported for them, without needing conscious but unprincipled development decisions to constrain us.

It would be helpful if you could share some pointers (1) to the original announcement and documentation for this restricting behaviour for URI exports (clearly, this information is vital for the ongoing discussion on property conversion), and (2) to the discussions have lead to this design (surely you must have consulted with some RDF/SPARQL users and developers to conclude that some P1921 should be ignored). I am really curious to learn what "we" refers to in "we made a conscious decision".

Markus

...

I think it would help the discussion if we could keep apart:

what follows from formal principles

what you (or I) consider best

what the software currently does

...
(3) The external id datatype does not provide any mapping and the criteria used for it by the community do not imply that such mappings should exist for these cases, or that they should not exist for other cases.

That is incorrect from the way Wikibase defines and uses the ExternalId datatype: the intent is indeed to say that something is an identifier that can be mapped, and that such a (direct) mapping is not supported for other data types. (That doesn't mean we will not offer different mappings for other data types, perhaps URLs for looking up coordinates, etc).

Modeling this explicitly is indeed the reason to have this datatype.

...
I am most worried about Daniel's remark. He says that we wants to use external ids to identify properties with "values that identify a resource", but does not mention the existing, community-supported mechanism for doing just that (2), and instead proposes another mechanism (3), which the community is clearly not using for this purpose at all.

That's a misunderstanding. The plan is to support P1921 for URI mappings, and we already do support P1630 for URL mappings. But we intentionally do this only for ExternalId values, not for plain strings or other types.

So, the technical implementation does follow the community convention, with the restriction that properties that should use this kind of mapping need to explicitly be declared to be identifiers. We are also considering implementing validation and normalization for ExternalId values, but it's not clear yet how we can safely apply community supplied validation and normalization patterns.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Daniel Kinzler

7:03 p.m.

Am 11.03.2016 um 11:20 schrieb Markus Kroetzsch:

...

Maybe the community needs a bit more explanation as to why you "consciously" decide to override their judgement.

The idea is to give the community a tool to explicitly model their judgement that something is an identifier, and introduce that idea of external identifiers into the software exactly because that need was expressed by the community. Relevant use cases: linking, mapping, and UI structure.

...

The use of property P1921 clearly tells you what the community wants. If we want to have URIs only for some subset of properties, then we will use P1921 only on a subset. It is very easy and gives us complete control. The use of ExternalId as an additional restricting mechanism is neither helpful nor desired.

Can you given an example of something you want to map to a URI, but that is not an external identifiers? There are probably edge cases, and thinking about them and deciding on the desired semantics is a good thing, I believe.

...

We can decide for ourselves which properties should have URIs exported for them, without needing conscious but unprincipled development decisions to constrain us.

"unprincipled", wow. The decision followed the principle that we want to have software that is extensible and maintainable, and we want a data model that makes explicit the semantics of values. Following these principles, the declaration of what a value is dictates what you can do with it. That's the basic idea of object oriented design.

Of course, it would be possible to ditch these principles, and use the "duck typing" approach: anything that has a formatter URL could be linked, etc. But that introduces several problems:

* modeling: values can suddenly stop "being" identifiers, or become other things, based on the statements on the property definition. This can lead to inconsistencies in the way values are represented in dumps etc.

* implementation: we would either need to hard code a special case, or a mechanism to apply all kinds of behaviors (formatting, mapping, parsing, etc) based on all kinds of statements on properties. We can hard code for a few things, but a general mechanism would hardly be scalable or maintainable. We do have a solid and simple mechanism based on data types that works fine to cover the use cases for external identifiers.

* stability: if we base more and more behavior of the software on properties and statements defined by the community, the community would no longer be free to modify such properties and statements. That would break the software. We do compromise about this sometimes: Wikibase can be configured to know about a few properties and items (such as P1630). But we should be careful about it, because it takes away control from the community.

* consistency: You can't link just any kind of value based on a formatter uri. That only works for string values, and probably shouldn't be done for string values that have the "url" data type. So linking would only work for properties declared to be plain strings per their data type. Again, behavior is bound to the data type.

These principles are actually why we have data types at all. You were there when we decided for having them. If we don't care about the points above, we wouldn't need data types at all, value types would be sufficient. Everything else would be covered by "if it quacks like a duck...". That would mean a less expressive data model, and more complicated software. A lot more complicated, if you want to apply this for everything.

...

It would be helpful if you could share some pointers (1) to the original announcement and documentation for this restricting behaviour for URI exports (clearly, this information is vital for the ongoing discussion on property conversion),

It's a modeling tool, not a restriction. If there are things that should be mapped to URIs but for some reason shouldn't have the ExternalId type, we should look at these edge cases closely to find out what is wrong. Since clearly, if it's not an identifier of some sort, it can't sensibly have a URI, and if it is an identifier of some sort, there should be no reason not to mark it as such to the software, by making it an ExternalId.

...

and (2) to the discussions have lead to this design (surely you must have consulted with some RDF/SPARQL users and developers to conclude that some P1921 should be ignored).

I do not think any should be ignored. I think that properties that use P1921 should be ExternalIds. Please explain why you would not want that.

...

I am really curious to learn what "we" refers to in "we made a conscious decision".

Decisions about the design and implementation of the software are made by the development team ("us"), based on requirements and considerations on technical as well as the product level, which in turn is informed from community interaction, among other things.

As is often the case, solutions that have to be maintainable and scalable are not quite as nice as one-off solutions for a special case. MediaWiki is conservative about adding special case features for good reasons: it's quite complex as it is, if it had tried to cater to every special case, it would have collapsed under its own weight a long time ago.

The idea is to generalize from special cases, and implement something that will work for many more cases, even though it perhaps covers only 90% of what you could do by catering to the special case directly.

Of course, overly generic multi-option multi-purpose mechanisms should also be avoided, because they are hard to understand and hard to maintain. So a balance needs to be found.

Trying to strike that balance, in 2012 we (in this case including you, iirc) designed data types to be a simple yet sufficiently generic mechanism for associating behavior with values. So now we use it to associate behavior with values (like mapping to URLs and URIs), and I am very reluctant to introduce another mechanism for associating behavior with values.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Tom Morris

9 May 9 May

2:52 a.m.

Has the identifier migration stalled? I was just looking at this page:

https://www.wikidata.org/wiki/Q622828

and the first 9 claims on the page are all identifiers. There are only two (Freebase & Disease Ontology) in the identifier section at the bottom of the page.

Tom

Lydia Pintscher

4:42 p.m.

On Sun, May 8, 2016 at 9:54 PM Tom Morris tfmorris@gmail.com wrote:

...

Has the identifier migration stalled? I was just looking at this page:
https://www.wikidata.org/wiki/Q622828
and the first 9 claims on the page are all identifiers. There are only two (Freebase & Disease Ontology) in the identifier section at the bottom of the page.

I just posted an update at https://www.wikidata.org/wiki/User:Addshore/Identifiers#Let.27s_get_this_don...

Cheers Lydia

Tom Morris

11 Mar 11 Mar

12:12 a.m.

On Wed, Mar 9, 2016 at 7:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

...
I don't see why it is an issue that some external identifiers don't translate to URIs. What complex logic is involved here? In RDF we should just add the plain identifier like we have it now as the default value,

If we say "since external IDs are in fact URIs, since they refer to external databases, then let's mark them as URI property and render them as full URI - i.e. let's instead of:

wd:Q1000336 wdt:P646 "/m/03pvzn"

say this:

wd:Q1000336 wdt:P646 https://www.freebase.com/m/03pvzn

This may make a lot of sense, since the interesting URL that people would like to see may be the latter, and the former is kind of chopped-off form of it we use for our internal purposes. OTOH, what if it wasn't easy or possible to generate the latter from the former automatically? Then we need some logic to figure that out.

...

From a machine processing point of view, a more interesting statement is

probably:

wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn https://www.freebase.com/m/03pvzn>

This is supposed to redirect to either RDF https://rdf.freebase.com/rdf/m.03pvzn https://www.freebase.com/m/03pvzn or HTML https://www.freebase.com/m/03pvzn based on content negotiation, but that seems to be broken right now and it always returns RDF.

Tom

Egon Willighagen

12:58 a.m.

On Thu, Mar 10, 2016 at 6:12 PM, Tom Morris tfmorris@gmail.com wrote:

...

On Wed, Mar 9, 2016 at 7:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote: From a machine processing point of view, a more interesting statement is probably:
wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn>

Yes, but this proposal matches part of the discussion... owl:sameAs is in many cases not appropriate and likely should not be the goal in the first place: in many cases there is not such a clear 1-to-1 relation, and even if there is a 1-to-1 relation, the above may still be inappropriate.

Egon

Young,Jeff (OR)

1:01 a.m.

Couldn't you use P460 when there is doubt?

https://www.wikidata.org/wiki/Property:P460

Jeff

...

-----Original Message----- From: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] On Behalf Of Egon Willighagen Sent: Thursday, March 10, 2016 12:58 PM To: Discussion list for the Wikidata project. wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Status and ETA External ID conversion

On Thu, Mar 10, 2016 at 6:12 PM, Tom Morris tfmorris@gmail.com wrote:

...
On Wed, Mar 9, 2016 at 7:37 PM, Stas Malyshev

smalyshev@wikimedia.org wrote:

...
From a machine processing point of view, a more interesting statement is

probably:

...
wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn>
Yes, but this proposal matches part of the discussion... owl:sameAs is in many cases not appropriate and likely should not be the goal in the first place: in many cases there is not such a clear 1-to-1 relation, and even if there is a 1-to- 1 relation, the above may still be inappropriate.

Egon

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

1:52 a.m.

Hi!

...

Couldn't you use P460 when there is doubt?

https://www.wikidata.org/wiki/Property:P460

P460's type is Item, which means it is relation between two Wikidata items. External ID is relation between Wikidata item and something outside Wikidata.

-- Stas Malyshev smalyshev@wikimedia.org

Young,Jeff (OR)

2:08 a.m.

Then perhaps umbel:isLike instead of owl:sameAs?

http://wiki.opensemanticframework.org/index.php/UMBEL_Vocabulary#isLike_Prop...

It conveys sameAs but with a hint of uncertainty.

...

-----Original Message----- From: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] On Behalf Of Stas Malyshev Sent: Thursday, March 10, 2016 1:52 PM To: Discussion list for the Wikidata project. wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Status and ETA External ID conversion

Hi!

...
Couldn't you use P460 when there is doubt?

https://www.wikidata.org/wiki/Property:P460

P460's type is Item, which means it is relation between two Wikidata items. External ID is relation between Wikidata item and something outside Wikidata.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

4:54 a.m.

Am 10.03.2016 um 20:08 schrieb Young,Jeff (OR):

...

Then perhaps umbel:isLike instead of owl:sameAs?

http://wiki.opensemanticframework.org/index.php/UMBEL_Vocabulary#isLike_Prop...

In some cases owl:equivalentProperty may be appropriate https://www.w3.org/TR/owl-ref/#equivalentProperty-def

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Egon Willighagen

1:16 p.m.

I think the predicate may need depend on the type of thing we're linking, and on what is being linked. A strong predicate, like owl:sameAs, requires a (very) strong similarity between concepts... this is often not the case... mind you, this applies also to the things being linked... there are enough alternatives, like rdf:seeAlso as probably one of the least informative predicaties, via skos:closeMatch and skos:exactMatch ... for chemicals, I have been involved in work by the Open PHACTS project (now foundation), led by Alasdair Gray, on "scientific lenses"... I'm biased, but the (conference) papers are a good read anyway... [eg Q23034460]

Egon

On Thu, Mar 10, 2016 at 8:08 PM, Young,Jeff (OR) jyoung@oclc.org wrote:

...

Then perhaps umbel:isLike instead of owl:sameAs?

http://wiki.opensemanticframework.org/index.php/UMBEL_Vocabulary#isLike_Prop...

It conveys sameAs but with a hint of uncertainty.

...
-----Original Message----- From: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] On Behalf Of Stas Malyshev Sent: Thursday, March 10, 2016 1:52 PM To: Discussion list for the Wikidata project. wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Status and ETA External ID conversion

Hi!

...
Couldn't you use P460 when there is doubt?

https://www.wikidata.org/wiki/Property:P460

P460's type is Item, which means it is relation between two Wikidata items. External ID is relation between Wikidata item and something outside Wikidata.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tom Morris

1:35 a.m.

On Thu, Mar 10, 2016 at 12:58 PM, Egon Willighagen < egon.willighagen@gmail.com> wrote:

...

On Thu, Mar 10, 2016 at 6:12 PM, Tom Morris tfmorris@gmail.com wrote:

...
On Wed, Mar 9, 2016 at 7:37 PM, Stas Malyshev smalyshev@wikimedia.org

wrote:

...
From a machine processing point of view, a more interesting statement is

probably:

...
wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn>
Yes, but this proposal matches part of the discussion...

Actually, it doesn't, but for some reason you chose not to quote the original URL which showed the difference. The URL https://www.freebase.com/m/03pvzn https://rdf.freebase.com/ns/m.03pvzn is not the same as the URI above.

...

owl:sameAs is in many cases not appropriate and likely should not be the goal in the first place: in many cases there is not such a clear 1-to-1 relation, and even if there is a 1-to-1 relation, the above may still be inappropriate.

So choose a predicate that you think is more appropriate. The important thing is that the URIs match so that computers can tell that they're the same thing.

Tom

Stas Malyshev

1:50 a.m.

Hi!

...

From a machine processing point of view, a more interesting statement is probably:
wd:Q1000336 owl:sameAs <https://rdf.freebase.com/ns/m.03pvzn
https://www.freebase.com/m/03pvzn>

That is much bolder claim, since it essentially says this is identity, they both refer to the same thing. And that may not always be true, as different databases may have different rules and different approaches to what entries mean - e.g. one database may talk about "book" meaning the content of the book regardless of how it is materialized, and another may mean by "book" a specific printed edition or even a specific physical object. It _might_ be appropriate for some cases, but I'm not sure can say that for all cases where we have links between Wikidata and other databases.

-- Stas Malyshev smalyshev@wikimedia.org

Egon Willighagen

7 Mar 7 Mar

6:26 p.m.

On Mon, Mar 7, 2016 at 9:13 AM, Lydia Pintscher Lydia.Pintscher@wikimedia.de wrote:

...

Ok. I think we're making this much more complicated than necessary. The question you should ask yourself is: Does this identify a concept in another database/website/...? Nice to have: a website to link to. Once we have that we can look at corner cases and exceptions.

OK, thanks for the clarification. Then I will oppose arguments about uniqueness with my opinions, experiences, and argument and focus on this instead.

This helps a lot!

Egon

...

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

3118

Age (days ago)

3183

Last active (days ago)

wikidata@lists.wikimedia.org

58 comments

19 participants

tags (0)

participants (19)

Andy Mabbett
Bene*
Daniel Kinzler
David Cuenca Tudela
Egon Willighagen
Gerard Meijssen
James Heald
Katie Filbert
Luca Martinelli
Luiz Augusto
Lydia Pintscher
Maarten Dammers
Magnus Manske
Markus Kroetzsch
Markus Krötzsch
Pine W
Stas Malyshev
Tom Morris
Young,Jeff (OR)