Hi everyone,
I am lately facing the following problem: There are many (biomedical) resources we import data from, which consist of several parts. And for each of these parts, they use either a different identifier structure, or they use the same identifier structure but with different accession URLs. This is valid for very essential resources like ChEMBL (e.g. compounds, targets, assays), miRNA database, IUPHAR and others
In order to represent and link to these resources properly in Wikidata, how should we do this? The "easy" way is to just propose properties for each of these parts of a resource, which also allows to specify the proper formatter url. But this certainly would create several properties for the same resource.
The other way would be to specify a set of formatter urls, but this fails currently anyway, as this has not been implemented (yet). Maybe we could specify formatter urls on a value basis which could override the formatter url specified in the property? But I guess this requires substantial dev time in Wikibase.
What are your thoughts/ideas?
Thanks!
Sebastian
On 27.04.2016 21:13, Sebastian Burgstaller wrote:
Hi everyone,
I am lately facing the following problem: There are many (biomedical) resources we import data from, which consist of several parts. And for each of these parts, they use either a different identifier structure, or they use the same identifier structure but with different accession URLs. This is valid for very essential resources like ChEMBL (e.g. compounds, targets, assays), miRNA database, IUPHAR and others
In order to represent and link to these resources properly in Wikidata, how should we do this? The "easy" way is to just propose properties for each of these parts of a resource, which also allows to specify the proper formatter url. But this certainly would create several properties for the same resource.
The other way would be to specify a set of formatter urls, but this fails currently anyway, as this has not been implemented (yet). Maybe we could specify formatter urls on a value basis which could override the formatter url specified in the property? But I guess this requires substantial dev time in Wikibase.
What are your thoughts/ideas?
Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either.
How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem.
Cheers,
Markus
Thanks!
Sebastian
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either.
How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem.
Yeah I also don't have a good solution for this case so far. IMDB is also causing issues. I think creating a service on labs is at least a good solution until we have something better built into Wikibase.
Cheers Lydia
Thanks for your replies!
I like the idea of a resolver service on labs, but I think it cannot solve the issue. e.g. for ChEMBL and IUPHAR (international union of pharmacology), there is no way to guess from the identifier onto the sub-domain this identifier belongs to. For ChEMBL, the pattern is: 'CHEMBL' prefix, followed by several digits, for IUPHAR IDs: digits only. For a resolver service to work in this case, it would either need to hold all ChEMBL IDs or would need to guess all ~5 URLs in order to find the correct one. Both approaches do not seem sustainable at this point.
So a temporary workaround could be to add the correct url as a qualifier, which I think would be doable, as most external ids will not have additional qualifiers anyway.
Cheers, Sebastian
On Thu, Apr 28, 2016 at 3:59 AM, Lydia Pintscher < Lydia.Pintscher@wikimedia.de> wrote:
On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either.
How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem.
Yeah I also don't have a good solution for this case so far. IMDB is also causing issues. I think creating a service on labs is at least a good solution until we have something better built into Wikibase.
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 28.04.2016 22:36, Sebastian Burgstaller wrote:
Thanks for your replies!
I like the idea of a resolver service on labs, but I think it cannot solve the issue. e.g. for ChEMBL and IUPHAR (international union of pharmacology), there is no way to guess from the identifier onto the sub-domain this identifier belongs to. For ChEMBL, the pattern is: 'CHEMBL' prefix, followed by several digits, for IUPHAR IDs: digits only. For a resolver service to work in this case, it would either need to hold all ChEMBL IDs or would need to guess all ~5 URLs in order to find the correct one. Both approaches do not seem sustainable at this point.
So a temporary workaround could be to add the correct url as a qualifier, which I think would be doable, as most external ids will not have additional qualifiers anyway.
One should also point out to the authorities maintaining these IDs that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration.
Cheers,
Markus
Cheers, Sebastian
On Thu, Apr 28, 2016 at 3:59 AM, Lydia Pintscher <Lydia.Pintscher@wikimedia.de mailto:Lydia.Pintscher@wikimedia.de> wrote:
On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch <markus.kroetzsch@tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either. How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem. Yeah I also don't have a good solution for this case so far. IMDB is also causing issues. I think creating a service on labs is at least a good solution until we have something better built into Wikibase. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de <http://www.wikimedia.de> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
"One should also point out to the authorities maintaining these IDs that
they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org corpus are type-specific at the level of identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings.
For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL.
I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes.
And of course for chembl the URL/URI to use would be
<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?
There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous.
Regards, Jerven
On 28/04/16 23:49, Julie McMurry wrote:
"One should also point out to the authorities maintaining these IDs
that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org http://identifiers.org corpus are type-specific at the level of identifiers.org http://identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I tend to agree with Jerven. He is right to say that URIs work best as identifiers. However, some things should still be kept in mind:
* The strings we are talking about are in fact IDs and not ambiguous: no string id identifies multiple objects. * The problem is in finding the right web page to refer a user to for each ID. URIs are often distinct from the URLs that users would like to read. It is even possible that there are already official URIs for some of the datasets we were talking about, and that these URIs do not help us in finding the right URL either.
In some datasets, the problem might be solved by switching to URIs, but this requires a working content negotiation to redirect users when they open the URI in their browser. I have some doubts that we can find this for the problematic cases, given that they don't even have a simple redirection service for finding their URLs.
Moreover, there is the technical problem that the design that has been selected for distinguishing external IDs in Wikidata is such that these IDs must be of type string.
In a perfect world, Jerven's approach would still be the cleanest, I believe, but it might be impractical at the moment.
Cheers,
Markus
On 29.04.2016 15:29, Jerven Tjalling Bolleman wrote:
Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings.
For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL.
I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes.
And of course for chembl the URL/URI to use would be
<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous.
Regards, Jerven
On 28/04/16 23:49, Julie McMurry wrote:
"One should also point out to the authorities maintaining these IDs
that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org http://identifiers.org corpus are type-specific at the level of identifiers.org http://identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
"I think this is the wrong way round, there should be an URL/URI there and
a formatter to generate a local string for display purposes." -JB
Where this URL/URI-to-display is possible, I agree with Jerven; however I'm (so far) completely naive about wikidata's ingest process (apologies). Sometimes that accession is all you've got to work with at the start. It is in some third-party'es database cross reference tables and that's pretty much it.
Even worse, sometimes the type of the encountered accession is implied by context but never stated outright (eg. is it gene? genotype? allele?) many providers of such data have type-specific URLs and integration fails catastrophically (either collides undetected or 404s) if the wrong type is assumed. However, if wikidata is ingesting solely data that is *already well documented with HTTP URIs*, then I agree; start with the HTTP URI.
However, let's first unpack what we mean by 'display' as it is rather loaded.
*Scenario one: *whatever users 'see' baked in the URLs of wikidata interface (or other 3rd party #researchparasites like us at Monarch). *Scenario two: *whatever is actually displayed most prominently in the 3rd party application web page itself *Scenario three:* whatever is represented in the land of 3rd party API and graph stores (RDF)
*For scenario one*, the main objective is stability and identifiability, rather than immediate human recognition. You could bake in an http uri, for example: wikidata.org/record/uri=[http uri]. This works OK if: A) you're more-or-less in control of the data you're "consuming", and B) there's one primary URI C) that URI is not volatile
eagle-i federated search uses this approach, for instance:
[image: Inline image 1]
However, '*control*' and '*uncontested uri*' and* 'permanence' *occur infrequently and they almost never co-occur. If 3rd party consumers/integrators don't have a well-planned strategy a-priori for to collapse distributed identifier equivalents, they may find it easy to *aggregate*, but difficult to *integrate*. Especially for highly-accessed resources (like chembl) with lots of relevant resources that are distributed and that reference the same chembl entity in endlessly different ways.
*For scenario two *the main objective is human recognition. On this I'm for whatever works (labels, names, pictures, whatever), provided that anchor text is also linked to http URIs and at some point in the link chain transparent about its origins.
*For scenario three* the main consumer of that format is the tech elite. Here, the HTTP URI is a first class citizen, however even here, the self-documenting nature of formats like RDF/JSON is that CURIE identifiers can still be made easier on the human eye. Many Sporny has a candid and funny take on this. http://bit.ly/case-for-curies
In PrefixCommons http://bit.ly/biocontext, we adopt a JSON-LD context approach (a cornerstone of schema.org that is already in use in over 10 million websites in the web-at-large). It it is efficient and scaleable for handling these distributed identifiers, especially for scenarios 1 and 3 above.
This is the approach we use in Monarch. Wherever possible we display human-readable labels in the UI; however what we embed in our URLs are always the location-independent curie version of the ID. I've added to our identifier documentation marked up screenshots of our identifiers in action http://bit.ly/monarch-ids in our site using NCBIGene:6622 as an example. [image: Inline image 2]
I've omitted the painful parts about the state of the identifiers when we find them in ingested data, but just as a quick illustration, see this (38 distinct shortform representations, and 14 distinct http URIs) http://bit.ly/ncbi-identifier-permutations. Note that all these permutations correspond to *a single identifier, NCBIGene:6622*, not to a class of identifiers, or a class of providers, or to similar entities spread across disparate resources. The state-of-affairs in database lookups is much the same as what you see in the web-at-large but to document that that would require a lot more effort.
The bottom line is that it doesn't matter how messy the identifiers are when you find them. Or how many different ways you find them. You can always create a curie form that links to the right place and you can use JSON-LD to provide context-specific direction.
For instance: in the following contexts expand the CURIE to the chembl webpage, in these other contexts, expand to the (sometimes unresolvable) HTTP URI, in these other contexts, expand to the wikidata page for the entity. This approach is simple and it works.
Hope this was helpful?
Best wishes, Julie
ps. The heterogeneity of identifiers also has major implications for data citation in journals; however, I've not touched that subject here as it is quite different (there's an archived PDF that can not be dynamically updated to change the identifiers or links).
"For instance: in the following contexts expand the CURIE to the chembl
webpage, in these other contexts, expand to the (sometimes unresolvable) HTTP URI, in these other contexts, expand to the wikidata page for the entity. "
This context-specific expansion is of course not to preclude the use of content negotiation, when that is actually provided. However, it is often useful to point, for instance, to HTML from provider A versus from provider B.
Was this helpful? We welcome any feedback or questions about the overall approach.
Also, we are collecting evidence of identifier pain http://bit.ly/evidence-of-identifier-pain; we welcome anyone to add their scenarios to it.
Best wishes, Julie
Hi Jerven, all
On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman jerven.bolleman@sib.swiss wrote:
Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings.
For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL.
I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes.
And of course for chembl the URL/URI to use would be
<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?
There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous.
What would you suggest for identifiers that do not have an official RDF serialization?
Egon
Regards, Jerven
On 28/04/16 23:49, Julie McMurry wrote:
"One should also point out to the authorities maintaining these IDs
that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org http://identifiers.org corpus are type-specific at the level of identifiers.org http://identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Jerven Bolleman Jerven.Bolleman@sib.swiss SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 1211 Geneve 4, Switzerland www.sib.swiss - www.uniprot.org Follow us at https://twitter.com/#!/uniprot
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Egon, All,
If something is identifiable by a String, but there is no official RDF serialization. Then I would make the identifier property in question a subproperty of
https://www.wikidata.org/wiki/Property:P973
The value should then be the "most" official url of a webpage having the thing identified by the string as its topic.
Lets take the example of Property:P351 (Entrez Gene ID) There is no offical RDF for Entrez Gene. So I would have made it a
p:P351 instanceof p:P973 .
Then formatter URL would need a capturing group because it is the inverse of the current logic.
p:P351 p:P1630 "https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&...)"
Then the example
wd:Q14911732 p:P351 "1017"
Would instead be recorded in the backend as
wd:Q14911732 p:P351 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=1017
I do believe the current UI view is correct, I just think URLs/URIs should be the preferred solution for identifiers in WikiData.
Regards, Jerven
On 02/05/16 20:00, Egon Willighagen wrote:
Hi Jerven, all
On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman jerven.bolleman@sib.swiss wrote:
Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings.
For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL.
I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes.
And of course for chembl the URL/URI to use would be
<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?
There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous.
What would you suggest for identifiers that do not have an official RDF serialization?
Egon
Regards, Jerven
On 28/04/16 23:49, Julie McMurry wrote:
"One should also point out to the authorities maintaining these IDs
that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org http://identifiers.org corpus are type-specific at the level of identifiers.org http://identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Jerven Bolleman Jerven.Bolleman@sib.swiss SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 1211 Geneve 4, Switzerland www.sib.swiss - www.uniprot.org Follow us at https://twitter.com/#!/uniprot
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Or in other words, I would like a formatter for Property:P1019 and most instance of wiki:Q19847637 should be instances of Property:P1019 instead.
Regards, Jerven
On 03/05/16 11:21, Jerven Tjalling Bolleman wrote:
Hi Egon, All,
If something is identifiable by a String, but there is no official RDF serialization. Then I would make the identifier property in question a subproperty of
https://www.wikidata.org/wiki/Property:P973
The value should then be the "most" official url of a webpage having the thing identified by the string as its topic.
Lets take the example of Property:P351 (Entrez Gene ID) There is no offical RDF for Entrez Gene. So I would have made it a
p:P351 instanceof p:P973 .
Then formatter URL would need a capturing group because it is the inverse of the current logic.
p:P351 p:P1630 "https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&...)"
Then the example
wd:Q14911732 p:P351 "1017"
Would instead be recorded in the backend as
wd:Q14911732 p:P351 https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=1017
I do believe the current UI view is correct, I just think URLs/URIs should be the preferred solution for identifiers in WikiData.
Regards, Jerven
On 02/05/16 20:00, Egon Willighagen wrote:
Hi Jerven, all
On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman jerven.bolleman@sib.swiss wrote:
Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings.
For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL.
I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes.
And of course for chembl the URL/URI to use would be
<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?
There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous.
What would you suggest for identifiers that do not have an official RDF serialization?
Egon
Regards, Jerven
On 28/04/16 23:49, Julie McMurry wrote:
"One should also point out to the authorities maintaining these IDs
that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated).
With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration."
Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to:
kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction
Thankfully, they've collapsed those to a single URL pattern.
The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions.
FYI 345 of the 560+ records in the identifiers.org http://identifiers.org corpus are type-specific at the level of identifiers.org http://identifiers.org's namespace; these roll up to ~300 providers.
The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 http://linkedchemistry.info/chembl/chemblid/CHEMBL308052 do you need to retrieve the type of the entity for reasons other than determining what URL to use?
How are you representing entity labels / IDs to users?
Best, Julie
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Jerven Bolleman Jerven.Bolleman@sib.swiss SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 1211 Geneve 4, Switzerland www.sib.swiss - www.uniprot.org Follow us at https://twitter.com/#!/uniprot
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata