This query times out:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?
However the MediaWiki wbsearchentities API does seem to use an index and is performant for label searching: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...
How can I get my SPARQL query to be more performant or asking the right question?
(BTW, once I have some answers, I intend to make this page a bit more useful with that information for other users ... https://www.wikidata.org/wiki/Wikidata:Data_access)
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:
This query times out:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?
Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.
However the MediaWiki wbsearchentities API does seem to use an index and is
performant for label searching:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...
wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.
How can I get my SPARQL query to be more performant or asking the right
question?
Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100
will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.
But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.
0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C.
Thank you so much David!
This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*
The community thanks you sincerely!
Thad https://www.linkedin.com/in/thadguidry/
On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:
This query times out:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?
Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.
However the MediaWiki wbsearchentities API does seem to use an index and
is performant for label searching:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...
wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.
How can I get my SPARQL query to be more performant or asking the right
question?
Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100
will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.
But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.
0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI
David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi David Causse,
Curious why https://www.wikidata.org/wiki/Q24033349 is not being returned in the below SPARQL?
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "front matter"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
Thad https://www.linkedin.com/in/thadguidry/
On Fri, Jul 17, 2020 at 12:37 PM Thad Guidry thadguidry@gmail.com wrote:
Thank you so much David!
This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*
The community thanks you sincerely!
Thad https://www.linkedin.com/in/thadguidry/
On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:
This query times out:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?
Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.
However the MediaWiki wbsearchentities API does seem to use an index and
is performant for label searching:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...
wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.
How can I get my SPARQL query to be more performant or asking the right
question?
Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100
will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.
But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.
0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Thad,
I think it's not being returned because Q24033349 has no wdt:P31 property and thus the part:
?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814).
while excluding all items being wd:Q13442814 will also exclude all items not having a P31 property. You can rewrite your query to take this into consideration using the MINUS keyword: https://w.wiki/Yzt .
Hope it helps,
David.
On Thu, Aug 6, 2020 at 11:26 PM Thad Guidry thadguidry@gmail.com wrote:
Hi David Causse,
Curious why https://www.wikidata.org/wiki/Q24033349 is not being returned in the below SPARQL?
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "front matter"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
Thad https://www.linkedin.com/in/thadguidry/
On Fri, Jul 17, 2020 at 12:37 PM Thad Guidry thadguidry@gmail.com wrote:
Thank you so much David!
This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*
The community thanks you sincerely!
Thad https://www.linkedin.com/in/thadguidry/
On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:
This query times out:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?
Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.
However the MediaWiki wbsearchentities API does seem to use an index and
is performant for label searching:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...
wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.
How can I get my SPARQL query to be more performant or asking the right
question?
Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:
SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100
will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.
But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100
This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.
0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Very nice David!
1. Does the MINUS actually utilize ElasticSearch indexes or just Blazegraph?
I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:
Excluding subsets
SPARQL has three different idioms for excluding subsets:
- OPTIONAL { ... ?x ... } FILTER(!bound(?x)), - FILTER NOT EXISTS { ... } - MINUS { ... }
Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.
2. Is that still a true statement that those 3 above use the same query plan currently?
Some answers inline,
On Fri, Aug 7, 2020 at 6:07 PM Thad Guidry thadguidry@gmail.com wrote:
Very nice David!
- Does the MINUS actually utilize ElasticSearch indexes or just
Blazegraph?
No, elasticsearch is being used only during the call to the wikibase:mwapi SERVICE. Everything happening outside this call is handled by blazegraph.
I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:
Excluding subsets
SPARQL has three different idioms for excluding subsets:
- OPTIONAL { ... ?x ... } FILTER(!bound(?x)),
- FILTER NOT EXISTS { ... }
- MINUS { ... }
Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.
- Is that still a true statement that those 3 above use the same query
plan currently?
I think they indeed serve the same purpose but might vary in subtle ways, for MINUS vs FILTER NOT EXISTS the sparql specs states that they can produce different solutions https://www.w3.org/TR/sparql11-query/#neg-notexists-minus. As to which approach is better I can't answer clearly, I tend to prefer MINUS as I find it easier to read/understand. I also tend to avoid plain FILTER(constraint on ?x) when possible as they tend to be rather slow (here the FILTER(!bound(?x)) should be pretty fast though).
David.
David,
Thank you so much. This is very helpful and I've improved the wiki docs in a few places with this new information.
Thad https://www.linkedin.com/in/thadguidry/
On Fri, Aug 7, 2020 at 12:31 PM David Causse dcausse@wikimedia.org wrote:
Some answers inline,
On Fri, Aug 7, 2020 at 6:07 PM Thad Guidry thadguidry@gmail.com wrote:
Very nice David!
- Does the MINUS actually utilize ElasticSearch indexes or just
Blazegraph?
No, elasticsearch is being used only during the call to the wikibase:mwapi SERVICE. Everything happening outside this call is handled by blazegraph.
I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:
Excluding subsets
SPARQL has three different idioms for excluding subsets:
- OPTIONAL { ... ?x ... } FILTER(!bound(?x)),
- FILTER NOT EXISTS { ... }
- MINUS { ... }
Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.
- Is that still a true statement that those 3 above use the same query
plan currently?
I think they indeed serve the same purpose but might vary in subtle ways, for MINUS vs FILTER NOT EXISTS the sparql specs states that they can produce different solutions https://www.w3.org/TR/sparql11-query/#neg-notexists-minus. As to which approach is better I can't answer clearly, I tend to prefer MINUS as I find it easier to read/understand. I also tend to avoid plain FILTER(constraint on ?x) when possible as they tend to be rather slow (here the FILTER(!bound(?x)) should be pretty fast though).
David. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata