A way to achieve this could be to fetch all labels and aliases for all chemical compounds in one query and store them locally in your web application. This certainly is only feasible if the number of compounds does not get to big in Wikdiata. Currently, the query takes ~ 6 sec.
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label WHERE { {?cmpnd wdt:P279 wd:Q11173 .} UNION {?cmpnd wdt:P31 wd:Q11173 .} ?cmpnd rdfs:label ?label . }
Best, Sebastian (sebotic)
Hi all! I'm building a web application where users can search for protein/compound/etc. names and view their 3D structure using WebGL. I'm currently using the PubChem (chemical compounds database) API to provide some autocomplete data, but I found that Wikidata also has many chemical compound names with PubChem indices! The most important reason to try to autocomplete compound names via Wikidata is to allow users to search in different languages. PubChem generally only provides English names. However, I could not find a suitable API for this. I tried building a SPARQL query but that quickly became very slow. I could not find an option to limit full-text searches to a specific subclass in the search API provided by:https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities. Do you have any ideas? The only option I see for now is iterating each response entity and looking up their subclass of/instance of property.
On Mon, Apr 25, 2016 at 7:23 PM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
A way to achieve this could be to fetch all labels and aliases for all chemical compounds in one query and store them locally in your web application. This certainly is only feasible if the number of compounds does not get to big in Wikdiata. Currently, the query takes ~ 6 sec.
But the search time goes down when you have something to search on, it seems... the following query takes <1.5s:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label WHERE { {?cmpnd wdt:P279 wd:Q11173 .} UNION {?cmpnd wdt:P31 wd:Q11173 .} ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
BTW, like Magnus said... if you only want to find things with the PubChem compound identifier, you could take that route:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label ?pubchemid WHERE { ?cmpnd wdt:P662 ?pubchemid . ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
But I am not sure that is a lot faster...
Also keep in mind that it seems to do a reasonable job at caching search results...
Egon
Hmm, with a contains filter this is perfect actually! I must have been doing something wrong because my query took about 20 seconds (I'm new to SPARQL, I think it was slow because I copied `?compound wdt:P31/wdt:P279* wd:Q11173` from somewhere). An alternative I came up with is using https://www.wikidata.org/w/api.php?action=wbsearchentities&search=benzee... https://www.wikidata.org/w/api.php?action=wbsearchentities&search=benzeen&language=nl and filtering out the entries that have 'organic compound' in their description etc. but this is much cleaner.
Op ma 25 apr. 2016 om 19:53 schreef Egon Willighagen < egon.willighagen@gmail.com>:
On Mon, Apr 25, 2016 at 7:23 PM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
A way to achieve this could be to fetch all labels and aliases for all chemical compounds in one query and store them locally in your web application. This certainly is only feasible if the number of compounds
does
not get to big in Wikdiata. Currently, the query takes ~ 6 sec.
But the search time goes down when you have something to search on, it seems... the following query takes <1.5s:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label WHERE { {?cmpnd wdt:P279 wd:Q11173 .} UNION {?cmpnd wdt:P31 wd:Q11173 .} ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
BTW, like Magnus said... if you only want to find things with the PubChem compound identifier, you could take that route:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label ?pubchemid WHERE { ?cmpnd wdt:P662 ?pubchemid . ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
But I am not sure that is a lot faster...
Also keep in mind that it seems to do a reasonable job at caching search results...
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
What you really want then is a case insensitive regex, otherwise you will not match the first letter correctly. You also want to remove the Wikibase extension, as it might slow down the query by 0.7 sec. Takes 1.4 sec for me now.
SELECT DISTINCT ?cmpnd ?label WHERE { {?cmpnd wdt:P279 wd:Q11173 .} UNION {?cmpnd wdt:P31 wd:Q11173 .} ?cmpnd rdfs:label ?label . FILTER(regex(str(?label), "^a", "i")) }
cheers, Sebastian
On Mon, Apr 25, 2016 at 12:16 PM, Herman Bergwerf hermanbergwerf@gmail.com wrote:
Hmm, with a contains filter this is perfect actually! I must have been doing something wrong because my query took about 20 seconds (I'm new to SPARQL, I think it was slow because I copied `?compound wdt:P31/wdt:P279* wd:Q11173` from somewhere). An alternative I came up with is using https://www.wikidata.org/w/api.php?action=wbsearchentities&search=benzee... https://www.wikidata.org/w/api.php?action=wbsearchentities&search=benzeen&language=nl and filtering out the entries that have 'organic compound' in their description etc. but this is much cleaner.
Op ma 25 apr. 2016 om 19:53 schreef Egon Willighagen < egon.willighagen@gmail.com>:
On Mon, Apr 25, 2016 at 7:23 PM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
A way to achieve this could be to fetch all labels and aliases for all chemical compounds in one query and store them locally in your web application. This certainly is only feasible if the number of compounds
does
not get to big in Wikdiata. Currently, the query takes ~ 6 sec.
But the search time goes down when you have something to search on, it seems... the following query takes <1.5s:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label WHERE { {?cmpnd wdt:P279 wd:Q11173 .} UNION {?cmpnd wdt:P31 wd:Q11173 .} ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
BTW, like Magnus said... if you only want to find things with the PubChem compound identifier, you could take that route:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT DISTINCT ?cmpnd ?label ?pubchemid WHERE { ?cmpnd wdt:P662 ?pubchemid . ?cmpnd rdfs:label ?label . FILTER (strstarts(?label, "a")) SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
But I am not sure that is a lot faster...
Also keep in mind that it seems to do a reasonable job at caching search results...
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata