At PIDapalooza 2018
I discussed with a few people - not Wikidatans - how to document Persistent (or Unique) Identifiers.
As an example, I created 'KoreaMed Unique Identifier':
https://www.wikidata.org/wiki/Q47489994
How could we improve that? What additional properties might we need? What properties already exist, that we could make use of?
Also, this query:
returns over 5000 instances/ subclasses of "unique identifier" (Q6545185) but includes both /types/ of identifiers (like the example above) and individual identifier values, like ".ar" as an internet TLD (domain name itself - Q32635 - is a subclass, not an instance, of UID) - how should we distinguish between the two classes?
Hi Andy,
Thanks, there seems to be quite a lot of work to do in this area indeed!
On 20/02/2018 19:49, Andy Mabbett wrote:
As an example, I created 'KoreaMed Unique Identifier':
https://www.wikidata.org/wiki/Q47489994
How could we improve that? What additional properties might we need? What properties already exist, that we could make use of?
I have recently proposed to create a "number of records" property to store the number of identifiers in a given scheme:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/number_of_records
This property could typically apply here. The idea behind this property is that we could compare its values to the number of uses of the corresponding property in Wikidata.
One other thing I would love to see happening on Wikidata is keeping track of the links between identifier schemes. If identifier X and identifier Y are both used by the same database Z, then we can probably use Z to "match" X to Y and conversely.
If we had many "uses (P2283)" and "used by (P1535)" statements to link identifiers to databases, we could then draw a graph of identifiers and databases using them. Given two identifiers, we could analyze the paths between these two identifiers…
For now the graph is a bit sparse: http://tinyurl.com/y89u3enx
(And you can already see one issue: even if we have a path from ORCID to ISNI, that does not mean that we can convert an ORCID id to an ISNI for the same person via this path, as GRID contains ISNIs for organizations only…)
Thanks a lot Andy for adding such statements on https://www.wikidata.org/wiki/Q43649390 by the way!
Also, this query:
returns over 5000 instances/ subclasses of "unique identifier" (Q6545185) but includes both /types/ of identifiers (like the example above) and individual identifier values, like ".ar" as an internet TLD (domain name itself - Q32635 - is a subclass, not an instance, of UID)
- how should we distinguish between the two classes?
Urgh, that's messy. I think I would just change the ontology: "domain name" (Q32635) should not be a subclass of "unique identifier" (Q6545185), but rather an instance of it. (Actually the uniqueness is debatable, I don't think DNS is meant to enforce any uniqueness at all, as it is very common for a website to have multiple domain names. So maybe just "domain name" "instance of" "identifier (Q853614)" would do).
Antonin
So, we should be able to formally specify the "domain" of identifiers. Perhaps that could be derived from the type constraints in linked properties, but I think it would make sense as an explicit property on the identifier.
Some identifiers, e.g., GND, VIAF, require special attention because they span multiple domains. A person identifer from GND bears other opportunities for cross-linking than an organization identifier. No idea so far how to handle that ...
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Antonin Delpeuch (lists) Gesendet: Mittwoch, 21. Februar 2018 10:57 An: wikidata@lists.wikimedia.org Betreff: Re: [Wikidata] Metadata about Persistent Identifiers
Hi Andy,
Thanks, there seems to be quite a lot of work to do in this area indeed!
On 20/02/2018 19:49, Andy Mabbett wrote:
As an example, I created 'KoreaMed Unique Identifier':
https://www.wikidata.org/wiki/Q47489994
How could we improve that? What additional properties might we need? What properties already exist, that we could make use of?
I have recently proposed to create a "number of records" property to store the number of identifiers in a given scheme:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/number_of_record s
This property could typically apply here. The idea behind this property is that we could compare its values to the number of uses of the corresponding property in Wikidata.
One other thing I would love to see happening on Wikidata is keeping track of the links between identifier schemes. If identifier X and identifier Y are both used by the same database Z, then we can probably use Z to "match" X to Y and conversely.
If we had many "uses (P2283)" and "used by (P1535)" statements to link identifiers to databases, we could then draw a graph of identifiers and databases using them. Given two identifiers, we could analyze the paths between these two identifiers…
For now the graph is a bit sparse: http://tinyurl.com/y89u3enx
(And you can already see one issue: even if we have a path from ORCID to ISNI, that does not mean that we can convert an ORCID id to an ISNI for the same person via this path, as GRID contains ISNIs for organizations only…)
Thanks a lot Andy for adding such statements on https://www.wikidata.org/wiki/Q43649390 by the way!
Also, this query:
returns over 5000 instances/ subclasses of "unique identifier" (Q6545185) but includes both /types/ of identifiers (like the example above) and individual identifier values, like ".ar" as an internet TLD (domain name itself - Q32635 - is a subclass, not an instance, of UID)
- how should we distinguish between the two classes?
Urgh, that's messy. I think I would just change the ontology: "domain name" (Q32635) should not be a subclass of "unique identifier" (Q6545185), but rather an instance of it. (Actually the uniqueness is debatable, I don't think DNS is meant to enforce any uniqueness at all, as it is very common for a website to have multiple domain names. So maybe just "domain name" "instance of" "identifier (Q853614)" would do).
Antonin
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 21 February 2018 at 12:03, Neubert, Joachim J.Neubert@zbw.eu wrote:
So, we should be able to formally specify the "domain" of identifiers. Perhaps that could be derived from the type constraints in linked properties, but I think it would make sense as an explicit property on the identifier.
Main subject (P921) ?
I certainly don't think users should have to query properties to find metadata about concepts.
Well - P921 is described as "primary topic of a work", and is an instance of WD property for items about works. Two possible issues:
- "domain" is a much clearer restriction than "main subject/primary topic" - which implies that there may be other secondary subjects. To identifiers, normally the more formal restriction applies.
- defining an identifier as a work seems a bit stretched - yet, no domain is given here :-)
So perhaps a new property is needed?
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Andy Mabbett Gesendet: Mittwoch, 21. Februar 2018 13:16 An: Discussion list for the Wikidata project Betreff: Re: [Wikidata] Metadata about Persistent Identifiers
On 21 February 2018 at 12:03, Neubert, Joachim J.Neubert@zbw.eu wrote:
So, we should be able to formally specify the "domain" of identifiers. Perhaps that could be derived from the type constraints in linked properties, but I think it would make sense as an explicit property on the
identifier.
Main subject (P921) ?
I certainly don't think users should have to query properties to find metadata about concepts.
-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Overall, how do we deal with this duplication of information (on the item about the identifier and on the corresponding Wikidata property)?
We do need to have items about unique identifiers (because they can have sitelinks) so would it make sense to make sure every Wikidata property for an ID is matched to a corresponding item about that id?
The problem is that many of our ids are not even advertised as UIDs by the people who mint them (many are just a primary key of some database leaking out from a URL pattern), so it would feel quite artificial to have items about these (but that's just my gut feeling).
Antonin
On 21/02/2018 12:32, Neubert, Joachim wrote:
Well - P921 is described as "primary topic of a work", and is an instance of WD property for items about works. Two possible issues:
"domain" is a much clearer restriction than "main subject/primary topic" - which implies that there may be other secondary subjects. To identifiers, normally the more formal restriction applies.
defining an identifier as a work seems a bit stretched - yet, no domain is given here :-)
So perhaps a new property is needed?
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Andy Mabbett Gesendet: Mittwoch, 21. Februar 2018 13:16 An: Discussion list for the Wikidata project Betreff: Re: [Wikidata] Metadata about Persistent Identifiers
On 21 February 2018 at 12:03, Neubert, Joachim J.Neubert@zbw.eu wrote:
So, we should be able to formally specify the "domain" of identifiers. Perhaps that could be derived from the type constraints in linked properties, but I think it would make sense as an explicit property on the
identifier.
Main subject (P921) ?
I certainly don't think users should have to query properties to find metadata about concepts.
-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 21 February 2018 at 13:18, Antonin Delpeuch (lists) lists@antonin.delpeuch.eu wrote:
We do need to have items about unique identifiers (because they can have sitelinks) so would it make sense to make sure every Wikidata property for an ID is matched to a corresponding item about that id?
The problem is that many of our ids are not even advertised as UIDs by the people who mint them (many are just a primary key of some database leaking out from a URL pattern), so it would feel quite artificial to have items about these (but that's just my gut feeling).
My preference would be to have an item for all of them, but even if others disagree, we definitely need items for those documented as PIDs/ UIDs in external sources, such as indentifiers.org
On 21 February 2018 at 09:57, Antonin Delpeuch (lists) lists@antonin.delpeuch.eu wrote:
On 20/02/2018 19:49, Andy Mabbett wrote:
this query:
returns over 5000 instances/ subclasses of "unique identifier" (Q6545185) but includes both /types/ of identifiers (like the example above) and individual identifier values, like ".ar" as an internet TLD (domain name itself - Q32635 - is a subclass, not an instance, of UID)
- how should we distinguish between the two classes?
I think I would just change the ontology: "domain name" (Q32635) should not be a subclass of "unique identifier" (Q6545185), but rather an instance of it.
I've done that; but it has only reduced the number of results to 3,456 - what else needs changing?
Consider, for instance:
https://www.wikidata.org/wiki/Q3596440
an "instance of" a "telephone numbering plan" (Q103903)
(Actually the uniqueness is debatable, I don't think DNS is meant to enforce any uniqueness at all, as it is very common for a website to have multiple domain names.
Surely the uniqueness works in the other direction? A UID identifies one unique subject; the subject can have multiple UIDs.
On 21/02/2018 13:55, Andy Mabbett wrote:
Consider, for instance:
https://www.wikidata.org/wiki/Q3596440
an "instance of" a "telephone numbering plan" (Q103903)
Hum, this becomes a bit tricky − maybe it's actually a legitimate identifier? I don't see any error in the chain…
(Actually the uniqueness is debatable, I don't think DNS is meant to enforce any uniqueness at all, as it is very common for a website to have multiple domain names.
Surely the uniqueness works in the other direction? A UID identifies one unique subject; the subject can have multiple UIDs.
Oh that's interesting! I would intuitively say that "identifier" already includes that uniqueness (otherwise we cannot really say that it identifies anything). Of course a given subject can have UIDs in different identifier schemes, issued by different bodies.
So I would say the "unique" bit in "unique identifier" refers to the other direction (an entity has at most one such identifier in the particular scheme). That is the uniqueness that is hard to obtain: it is hard to ensure that a large database does not contain duplicates (and it is much easier to ensure that a given string points to at most one record in that database). So, when we say "unique identifier", I would intuitively say that the uniqueness we stress is the valuable one.
But I might be wrong! What do others think?
Antonin