Differences in label searching with SPARQL and MediaWiki API

List overview All Threads
Download

newer

older

Blog post and report on Indic...

Weekly Summary #427

Thad Guidry

11 Jul 2020 11 Jul '20

5:11 p.m.

This query times out:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?

However the MediaWiki wbsearchentities API does seem to use an index and is performant for label searching: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...

How can I get my SPARQL query to be more performant or asking the right question?

(BTW, once I have some answers, I intend to make this page a bit more useful with that information for other users ... https://www.wikidata.org/wiki/Wikidata:Data_access)

Thad https://www.linkedin.com/in/thadguidry/

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

David Causse

13 Jul 13 Jul

7:25 a.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:

...

This query times out:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?

Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.

However the MediaWiki wbsearchentities API does seem to use an index and is

...

performant for label searching:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...

wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.

How can I get my SPARQL query to be more performant or asking the right

...

question?

Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100

will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.

But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.

0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C.

Thad Guidry

17 Jul 17 Jul

5:37 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

Thank you so much David!

This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*

The community thanks you sincerely!

Thad https://www.linkedin.com/in/thadguidry/

On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:

...

On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:

...
This query times out:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?

Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.

However the MediaWiki wbsearchentities API does seem to use an index and

...
is performant for label searching:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...

wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.

How can I get my SPARQL query to be more performant or asking the right

...
question?

Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100

will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.

But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.

0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI

David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thad Guidry

6 Aug 6 Aug

9:26 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

Hi David Causse,

Curious why https://www.wikidata.org/wiki/Q24033349 is not being returned in the below SPARQL?

https://w.wiki/YwL

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "front matter"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

Thad https://www.linkedin.com/in/thadguidry/

On Fri, Jul 17, 2020 at 12:37 PM Thad Guidry thadguidry@gmail.com wrote:

...

Thank you so much David!

This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*

The community thanks you sincerely!

Thad https://www.linkedin.com/in/thadguidry/

On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:

...
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:

...
This query times out:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?

Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.

However the MediaWiki wbsearchentities API does seem to use an index and

...
is performant for label searching:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...

wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.

How can I get my SPARQL query to be more performant or asking the right

...
question?

Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100

will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.

But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.

0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

David Causse

7 Aug 7 Aug

3:38 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

Hi Thad,

I think it's not being returned because Q24033349 has no wdt:P31 property and thus the part:

?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814).

while excluding all items being wd:Q13442814 will also exclude all items not having a P31 property. You can rewrite your query to take this into consideration using the MINUS keyword: https://w.wiki/Yzt .

Hope it helps,

David.

On Thu, Aug 6, 2020 at 11:26 PM Thad Guidry thadguidry@gmail.com wrote:

...

Hi David Causse,

Curious why https://www.wikidata.org/wiki/Q24033349 is not being returned in the below SPARQL?

https://w.wiki/YwL

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q13442814). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "front matter"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

Thad https://www.linkedin.com/in/thadguidry/

On Fri, Jul 17, 2020 at 12:37 PM Thad Guidry thadguidry@gmail.com wrote:

...
Thank you so much David!

This was such a great example that I had to add this to our SPARQL Examples page in a new section "Mediawiki API": *https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples... https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Mediawiki_API*

The community thanks you sincerely!

Thad https://www.linkedin.com/in/thadguidry/

On Mon, Jul 13, 2020 at 2:26 AM David Causse dcausse@wikimedia.org wrote:

...
On Sat, Jul 11, 2020 at 7:12 PM Thad Guidry thadguidry@gmail.com wrote:

...
This query times out:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label ?label ; rdfs:label ?enLabel . FILTER(CONTAINS(lcase(?label), "Soriano")). FILTER(?instance != wd:Q5). SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

I have this feeling that it's not actually using an index or even asking the right question and so is slow and times out?

Indeed, none of the criteria in your query allows the triple store to determine an index to follow to extract the results in a timely manner. The sole non negative criterion would be FILTER(CONTAINS(lcase(?label), "Soriano")) but being in a FILTER and moreover a function it cannot be used to determine an index to work on. The only way to speed-up your query would be to introduce a discriminant "matching" criterion.

However the MediaWiki wbsearchentities API does seem to use an index and

...
is performant for label searching:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sorian...

wbsearchentitiies is backed by elasticsearch which is optimized for such lookups.

How can I get my SPARQL query to be more performant or asking the right

...
question?

Unfortunate I don't see an obvious way to adapt your sparql query and keep exactly the same semantic but to illustrate the problem:

SELECT ?item ?label WHERE { ?item wdt:P31 ?instance ; rdfs:label "Soriano"@en . FILTER(?instance != wd:Q5). } LIMIT 100

will return results in a timely manner, only because we helped the graph traversal with an initial path on ?item rdfs:label "Soriano"@en.

But by combining the query service and the wikidata API[0] baked by elasticsearch I think you can extract what you want:

SELECT ?item ?itemLabel WHERE { ?item wdt:P31 ?instance . FILTER(?instance != wd:Q5). SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "EntitySearch"; mwapi:search "soriano"; mwapi:language "en". ?item wikibase:apiOutputItem mwapi:item. } SERVICE wikibase:label {bd:serviceParam wikibase:language "en".} } LIMIT 100

This query will first contact EntitySearch (an alias to wbsearchentities) which will pass the items it found to the triple store which in turn can now query the graph in a timely manner. Obviously this solution only works if the number of items returned by wbsearchentities remains reasonable.

0: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI -- David C. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thad Guidry

4:06 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

Very nice David!

1. Does the MINUS actually utilize ElasticSearch indexes or just Blazegraph?

I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial

https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:

Excluding subsets

SPARQL has three different idioms for excluding subsets:

- OPTIONAL { ... ?x ... } FILTER(!bound(?x)), - FILTER NOT EXISTS { ... } - MINUS { ... }

Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.

2. Is that still a true statement that those 3 above use the same query plan currently?

Thad https://www.linkedin.com/in/thadguidry/

David Causse

5:30 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

Some answers inline,

On Fri, Aug 7, 2020 at 6:07 PM Thad Guidry thadguidry@gmail.com wrote:

...

Very nice David!

Does the MINUS actually utilize ElasticSearch indexes or just

Blazegraph?

No, elasticsearch is being used only during the call to the wikibase:mwapi SERVICE. Everything happening outside this call is handled by blazegraph.

...

I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial

https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:

Excluding subsets

SPARQL has three different idioms for excluding subsets:

OPTIONAL { ... ?x ... } FILTER(!bound(?x)),

FILTER NOT EXISTS { ... }

MINUS { ... }

Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.

Is that still a true statement that those 3 above use the same query

plan currently?

I think they indeed serve the same purpose but might vary in subtle ways, for MINUS vs FILTER NOT EXISTS the sparql specs states that they can produce different solutions https://www.w3.org/TR/sparql11-query/#neg-notexists-minus. As to which approach is better I can't answer clearly, I tend to prefer MINUS as I find it easier to read/understand. I also tend to avoid plain FILTER(constraint on ?x) when possible as they tend to be rather slow (here the FILTER(!bound(?x)) should be pretty fast though).

David.

Thad Guidry

6:40 p.m.

New subject: Differences in label searching with SPARQL and MediaWiki API

David,

Thank you so much. This is very helpful and I've improved the wiki docs in a few places with this new information.

Thad https://www.linkedin.com/in/thadguidry/

On Fri, Aug 7, 2020 at 12:31 PM David Causse dcausse@wikimedia.org wrote:

...

Some answers inline,

On Fri, Aug 7, 2020 at 6:07 PM Thad Guidry thadguidry@gmail.com wrote:

...
Very nice David!

Does the MINUS actually utilize ElasticSearch indexes or just

Blazegraph?

No, elasticsearch is being used only during the call to the wikibase:mwapi SERVICE. Everything happening outside this call is handled by blazegraph.

...
I'd like to help the community by writing up a bit better documentation on our SPARQL pages that talks about FILTER() versus MINUS() if no one has this info floating around? The only footnote I saw was: " MINUS lets you select results that *don’t* fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit." at the bottom of the SPARQL tutorial

https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial and the wiki page SPARQL query service https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#Excluding_subsets has:

Excluding subsets

SPARQL has three different idioms for excluding subsets:

OPTIONAL { ... ?x ... } FILTER(!bound(?x)),

FILTER NOT EXISTS { ... }

MINUS { ... }

Currently, in almost all circumstances, Blazegraph resolves all of these to the same query plan.

Is that still a true statement that those 3 above use the same query

plan currently?

I think they indeed serve the same purpose but might vary in subtle ways, for MINUS vs FILTER NOT EXISTS the sparql specs states that they can produce different solutions https://www.w3.org/TR/sparql11-query/#neg-notexists-minus. As to which approach is better I can't answer clearly, I tend to prefer MINUS as I find it easier to read/understand. I also tend to avoid plain FILTER(constraint on ?x) when possible as they tend to be rather slow (here the FILTER(!bound(?x)) should be pretty fast though).

David. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

1517

Age (days ago)

1544

Last active (days ago)

wikidata@lists.wikimedia.org

7 comments

2 participants

tags (0)

participants (2)

David Causse
Thad Guidry