Multiple properties/identifiers for the same resource

List overview All Threads
Download

newer

older

Instabilities in Wikidata Query...

Fwd: [freebase-discuss] So long...

Sebastian Burgstaller

27 Apr 2016 27 Apr '16

9:13 p.m.

Hi everyone, I am lately facing the following problem: There are many (biomedical) resources we import data from, which consist of several parts. And for each of these parts, they use either a different identifier structure, or they use the same identifier structure but with different accession URLs. This is valid for very essential resources like ChEMBL (e.g. compounds, targets, assays), miRNA database, IUPHAR and others In order to represent and link to these resources properly in Wikidata, how should we do this? The "easy" way is to just propose properties for each of these parts of a resource, which also allows to specify the proper formatter url. But this certainly would create several properties for the same resource. The other way would be to specify a set of formatter urls, but this fails currently anyway, as this has not been implemented (yet). Maybe we could specify formatter urls on a value basis which could override the formatter url specified in the property? But I guess this requires substantial dev time in Wikibase. What are your thoughts/ideas? Thanks! Sebastian

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by thread

Markus Kroetzsch

27 Apr 27 Apr

9:53 p.m.

On 27.04.2016 21:13, Sebastian Burgstaller wrote:

...

Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either. How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem. Cheers, Markus

...

Thanks! Sebastian _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Lydia Pintscher

28 Apr 28 Apr

12:59 p.m.

On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch < markus.kroetzsch(a)tu-dresden.de> wrote:

...

Yeah I also don't have a good solution for this case so far. IMDB is also causing issues. I think creating a service on labs is at least a good solution until we have something better built into Wikibase. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Sebastian Burgstaller

10:36 p.m.

Thanks for your replies! I like the idea of a resolver service on labs, but I think it cannot solve the issue. e.g. for ChEMBL and IUPHAR (international union of pharmacology), there is no way to guess from the identifier onto the sub-domain this identifier belongs to. For ChEMBL, the pattern is: 'CHEMBL' prefix, followed by several digits, for IUPHAR IDs: digits only. For a resolver service to work in this case, it would either need to hold all ChEMBL IDs or would need to guess all ~5 URLs in order to find the correct one. Both approaches do not seem sustainable at this point. So a temporary workaround could be to add the correct url as a qualifier, which I think would be doable, as most external ids will not have additional qualifiers anyway. Cheers, Sebastian On Thu, Apr 28, 2016 at 3:59 AM, Lydia Pintscher < Lydia.Pintscher(a)wikimedia.de> wrote:

...

On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch < markus.kroetzsch(a)tu-dresden.de> wrote:

Markus Kroetzsch

11:06 p.m.

On 28.04.2016 22:36, Sebastian Burgstaller wrote:

...

One should also point out to the authorities maintaining these IDs that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated). With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration. Cheers, Markus

...

Cheers, Sebastian On Thu, Apr 28, 2016 at 3:59 AM, Lydia Pintscher <Lydia.Pintscher(a)wikimedia.de <mailto:Lydia.Pintscher@wikimedia.de>> wrote: On Wed, Apr 27, 2016 at 9:54 PM Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: Doing this in Wikidata is tricky and takes time. I don't even see how to do it well (note that external tools like Reasonator or SQID would also need to implement the same smart resolution mechanism). Having several properties for the same thing just because of different ID types used does not seem very compelling either. How about building a little external referrer service that redirects IDs to the correct resource based on their structure? This could be a simple PHP-based web service hosted on Labs. In the end, the formatter URL is just for users to click on, so as long as you end up at the right place, this little indirection is maybe no problem. Yeah I also don't have a good solution for this case so far. IMDB is also causing issues. I think creating a service on labs is at least a good solution until we have something better built into Wikibase. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de <http://www.wikimedia.de> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207. _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Julie McMurry

11:49 p.m.

...

"One should also point out to the authorities maintaining these IDs that

they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated). With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration." Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to: kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction Thankfully, they've collapsed those to a single URL pattern. The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions. FYI 345 of the 560+ records in the identifiers.org corpus are type-specific at the level of identifiers.org's namespace; these roll up to ~300 providers. The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 <http://linkedchemistry.info/chembl/chemblid/CHEMBL308052> do you need to retrieve the type of the entity for reasons other than determining what URL to use? How are you representing entity labels / IDs to users? Best, Julie

Jerven Tjalling Bolleman

29 Apr 29 Apr

3:29 p.m.

Could I be so bold to suggest that in Wikidata we should strive to use external URI's for identifiers not Strings. For example in Wikidata, there are a lot of UniProt accessions. e.g. behind the property https://www.wikidata.org/wiki/P352 and there is a formatter for a URL. I think this is the wrong way round, there should be an URL/URI there and a formatter to generate a local string for display purposes. And of course for chembl the URL/URI to use would be <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690? There a 2 advantages to this. It allows easier federates queries from the source databases into wikidata (no URI conversions etc..) The second is that these URIs are clearly not ambiguous. Regards, Jerven On 28/04/16 23:49, Julie McMurry wrote:

...

"One should also point out to the authorities maintaining these IDs

that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated). With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration." Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to: kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction Thankfully, they've collapsed those to a single URL pattern. The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions. FYI 345 of the 560+ records in the identifiers.org <http://identifiers.org> corpus are type-specific at the level of identifiers.org <http://identifiers.org>'s namespace; these roll up to ~300 providers. The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 <http://linkedchemistry.info/chembl/chemblid/CHEMBL308052> do you need to retrieve the type of the entity for reasons other than determining what URL to use? How are you representing entity labels / IDs to users? Best, Julie _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- ------------------------------------------------------------------- Jerven Bolleman Jerven.Bolleman(a)sib.swiss SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85 CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58 1211 Geneve 4, Switzerland www.sib.swiss - www.uniprot.org Follow us at https://twitter.com/#!/uniprot -------------------------------------------------------------------

Markus Kroetzsch

10:58 p.m.

I tend to agree with Jerven. He is right to say that URIs work best as identifiers. However, some things should still be kept in mind: * The strings we are talking about are in fact IDs and not ambiguous: no string id identifies multiple objects. * The problem is in finding the right web page to refer a user to for each ID. URIs are often distinct from the URLs that users would like to read. It is even possible that there are already official URIs for some of the datasets we were talking about, and that these URIs do not help us in finding the right URL either. In some datasets, the problem might be solved by switching to URIs, but this requires a working content negotiation to redirect users when they open the URI in their browser. I have some doubts that we can find this for the problematic cases, given that they don't even have a simple redirection service for finding their URLs. Moreover, there is the technical problem that the design that has been selected for distinguishing external IDs in Wikidata is such that these IDs must be of type string. In a perfect world, Jerven's approach would still be the cleanest, I believe, but it might be impractical at the moment. Cheers, Markus On 29.04.2016 15:29, Jerven Tjalling Bolleman wrote:

...

"One should also point out to the authorities maintaining these IDs

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Julie McMurry

30 Apr 30 Apr

1:13 a.m.

...

"I think this is the wrong way round, there should be an URL/URI there and

a formatter to generate a local string for display purposes." -JB Where this URL/URI-to-display is possible, I agree with Jerven; however I'm (so far) completely naive about wikidata's ingest process (apologies). Sometimes that accession is all you've got to work with at the start. It is in some third-party'es database cross reference tables and that's pretty much it. Even worse, sometimes the type of the encountered accession is implied by context but never stated outright (eg. is it gene? genotype? allele?) many providers of such data have type-specific URLs and integration fails catastrophically (either collides undetected or 404s) if the wrong type is assumed. However, if wikidata is ingesting solely data that is *already well documented with HTTP URIs*, then I agree; start with the HTTP URI. However, let's first unpack what we mean by 'display' as it is rather loaded. *Scenario one: *whatever users 'see' baked in the URLs of wikidata interface (or other 3rd party #researchparasites like us at Monarch). *Scenario two: *whatever is actually displayed most prominently in the 3rd party application web page itself *Scenario three:* whatever is represented in the land of 3rd party API and graph stores (RDF) *For scenario one*, the main objective is stability and identifiability, rather than immediate human recognition. You could bake in an http uri, for example: wikidata.org/record/uri=[http uri]. This works OK if: A) you're more-or-less in control of the data you're "consuming", and B) there's one primary URI C) that URI is not volatile eagle-i federated search uses this approach, for instance: [image: Inline image 1] However, '*control*' and '*uncontested uri*' and* 'permanence' *occur infrequently and they almost never co-occur. If 3rd party consumers/integrators don't have a well-planned strategy a-priori for to collapse distributed identifier equivalents, they may find it easy to *aggregate*, but difficult to *integrate*. Especially for highly-accessed resources (like chembl) with lots of relevant resources that are distributed and that reference the same chembl entity in endlessly different ways. *For scenario two *the main objective is human recognition. On this I'm for whatever works (labels, names, pictures, whatever), provided that anchor text is also linked to http URIs and at some point in the link chain transparent about its origins. *For scenario three* the main consumer of that format is the tech elite. Here, the HTTP URI is a first class citizen, however even here, the self-documenting nature of formats like RDF/JSON is that CURIE identifiers can still be made easier on the human eye. Many Sporny has a candid and funny take on this. <http://bit.ly/case-for-curies> In PrefixCommons <http://bit.ly/biocontext>, we adopt a JSON-LD context approach (a cornerstone of schema.org that is already in use in over 10 million websites in the web-at-large). It it is efficient and scaleable for handling these distributed identifiers, especially for scenarios 1 and 3 above. This is the approach we use in Monarch. Wherever possible we display human-readable labels in the UI; however what we embed in our URLs are always the location-independent curie version of the ID. I've added to our identifier documentation marked up screenshots of our identifiers in action <http://bit.ly/monarch-ids> in our site using NCBIGene:6622 as an example. [image: Inline image 2] I've omitted the painful parts about the state of the identifiers when we find them in ingested data, but just as a quick illustration, see this (38 distinct shortform representations, and 14 distinct http URIs) <http://bit.ly/ncbi-identifier-permutations>. Note that all these permutations correspond to *a single identifier, NCBIGene:6622*, not to a class of identifiers, or a class of providers, or to similar entities spread across disparate resources. The state-of-affairs in database lookups is much the same as what you see in the web-at-large but to document that that would require a lot more effort. The bottom line is that it doesn't matter how messy the identifiers are when you find them. Or how many different ways you find them. You can always create a curie form that links to the right place and you can use JSON-LD to provide context-specific direction. For instance: in the following contexts expand the CURIE to the chembl webpage, in these other contexts, expand to the (sometimes unresolvable) HTTP URI, in these other contexts, expand to the wikidata page for the entity. This approach is simple and it works. Hope this was helpful? Best wishes, Julie ps. The heterogeneity of identifiers also has major implications for data citation in journals; however, I've not touched that subject here as it is quite different (there's an archived PDF that can not be dynamically updated to change the identifiers or links).

Julie McMurry

2 May 2 May

5:56 p.m.

...

"For instance: in the following contexts expand the CURIE to the chembl

webpage, in these other contexts, expand to the (sometimes unresolvable) HTTP URI, in these other contexts, expand to the wikidata page for the entity. " This context-specific expansion is of course not to preclude the use of content negotiation, when that is actually provided. However, it is often useful to point, for instance, to HTML from provider A versus from provider B. Was this helpful? We welcome any feedback or questions about the overall approach. Also, we are collecting evidence of identifier pain <http://bit.ly/evidence-of-identifier-pain>; we welcome anyone to add their scenarios to it. Best wishes, Julie

Egon Willighagen

8 p.m.

Hi Jerven, all On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman <jerven.bolleman(a)sib.swiss> wrote:

...

What would you suggest for identifiers that do not have an official RDF serialization? Egon

...

Regards, Jerven On 28/04/16 23:49, Julie McMurry wrote:

"One should also point out to the authorities maintaining these IDs

-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen

Jerven Tjalling Bolleman

3 May 3 May

11:21 a.m.

Hi Egon, All, If something is identifiable by a String, but there is no official RDF serialization. Then I would make the identifier property in question a subproperty of https://www.wikidata.org/wiki/Property:P973 The value should then be the "most" official url of a webpage having the thing identified by the string as its topic. Lets take the example of Property:P351 (Entrez Gene ID) There is no offical RDF for Entrez Gene. So I would have made it a p:P351 instanceof p:P973 . Then formatter URL would need a capturing group because it is the inverse of the current logic. p:P351 p:P1630 "https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=(.+)" Then the example wd:Q14911732 p:P351 "1017" Would instead be recorded in the backend as wd:Q14911732 p:P351 <https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=retrieve&dopt=default&list_uids=1017> I do believe the current UI view is correct, I just think URLs/URIs should be the preferred solution for identifiers in WikiData. Regards, Jerven On 02/05/16 20:00, Egon Willighagen wrote:

...

Hi Jerven, all On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman <jerven.bolleman(a)sib.swiss> wrote:

What would you suggest for identifiers that do not have an official RDF serialization? Egon

Regards, Jerven On 28/04/16 23:49, Julie McMurry wrote:

"One should also point out to the authorities maintaining these IDs

Jerven Tjalling Bolleman

11:29 a.m.

Or in other words, I would like a formatter for Property:P1019 and most instance of wiki:Q19847637 should be instances of Property:P1019 instead. Regards, Jerven On 03/05/16 11:21, Jerven Tjalling Bolleman wrote:

...

Hi Jerven, all On Fri, Apr 29, 2016 at 3:29 PM, Jerven Tjalling Bolleman <jerven.bolleman(a)sib.swiss> wrote:

What would you suggest for identifiers that do not have an official RDF serialization? Egon

Regards, Jerven On 28/04/16 23:49, Julie McMurry wrote:

> > "One should also point out to the authorities maintaining these IDs that they should spend some effort on producing a workable solution for this. It seems they should be the first to provide a resolver service (or maybe it would be an "ID search engine" if it is so complicated). With the qualifiers in place, Wikidata can also be used to achieve this, of course, but it seems we are just manually reverse engineering something that should be done at the site of whoever is controlling the ID registration." Well said, Markus. A most hearty agreement here on my side and one colleagues and I have been trying to raise awareness of for a long time now (http://bit.ly/id-guidance). One of the challenges is that databases are already being asked to do more with less. They can see the utility of such a service to others, but when I've asked DBs before (not naming names), traction has been limp (I've yet to ask Chembl). Sometimes it works out though. For instance, KEGG used to have 12 different type-specific URLs, corresponding to: kegg.compound kegg.disease kegg.drug kegg.environ kegg.genes kegg.genome kegg.glycan kegg.metagenome kegg.module kegg.orthology kegg.pathway kegg.reaction Thankfully, they've collapsed those to a single URL pattern. The databases that find it the toughest are not those who simply don't embed typing, but rather those that don't embed typing AND ALSO have local identifiers that would otherwise collide. For instance, a prominent bio database is in this boat (not naming names) and would like to make things better but it is hard and messy due to the collisions. FYI 345 of the 560+ records in the identifiers.org <http://identifiers.org> corpus are type-specific at the level of identifiers.org <http://identifiers.org>'s namespace; these roll up to ~300 providers. The question though is what WikiData is trying to accomplish. Say you encounter the chembl ID CHEMBL308052 <http://linkedchemistry.info/chembl/chemblid/CHEMBL308052> do you need to retrieve the type of the entity for reasons other than determining what URL to use? How are you representing entity labels / IDs to users? Best, Julie _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

2914

days inactive

2920

days old

wikidata@lists.wikimedia.org

Manage subscription

12 comments

6 participants

tags (0)

participants (6)

Egon Willighagen
Jerven Tjalling Bolleman
Julie McMurry
Lydia Pintscher
Markus Kroetzsch
Sebastian Burgstaller