Help make this Property Query faster

List overview All Threads
Download

newer

older

CfP International Conference on...

Semantic Web: A 20-year...

Thad Guidry

29 Oct 2021 29 Oct '21

10:11 a.m.

Hi David and team,

In Yi Liu's tool, Wikidata Property Explorer, I noticed that the query performance could be better ideally. Currently the query takes about 9 seconds and I'm asking if there might be anything to help reduce that considerably? Refactoring query for optimization, backend changes, anything you can think of Davd?

SELECT DISTINCT ?prop ?label ?desc ?type (GROUP_CONCAT(DISTINCT ?alias; SEPARATOR = " | ") AS ?aliases) WHERE { ?prop (wdt:P31/(wdt:P279*)) wd:Q18616576; wikibase:propertyType ?type. OPTIONAL { ?prop rdfs:label ?label. FILTER((LANG(?label)) = "en") } OPTIONAL { ?prop schema:description ?desc. FILTER((LANG(?desc)) = "en") } OPTIONAL { ?prop skos:altLabel ?alias. FILTER((LANG(?alias)) = "en") } } GROUP BY ?prop ?label ?desc ?type

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

Attachments:

attachment.htm (text/html — 1.6 KB)

Show replies by date

Young,Jeff (OR)

29 Oct 29 Oct

10:19 a.m.

New subject: [External] Help make this Property Query faster

The hint:Prior optimization on the property path might help a bit?

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimizati...

https://w.wiki/4Jas

Jeff

From: Thad Guidry thadguidry@gmail.com Date: Friday, October 29, 2021 at 10:12 AM To: David Causse dcausse@wikimedia.org, Discussion list for the Wikidata project. wikidata@lists.wikimedia.org, stevenliuyi@gmail.com stevenliuyi@gmail.com Subject: [External] [Wikidata] Help make this Property Query faster Hi David and team,

Thomas Douillard

2 Nov 2 Nov

2:38 p.m.

You can drop the « (wdt:P31/(wdt:P279*)) wd:Q18616576; » part, it’s useless. « ?prop wikibase:propertyType ?type. » is enough : https://w.wiki/4Kmp and it’s fast.

What seem to be really expensive is the label part, just adding the label (alone) at least triples or quadruples the query time, this takes us from less than a seconds to 3/4 seconds

https://w.wiki/4Kms

https://w.wiki/4Kmv

Le ven. 29 oct. 2021 à 16:12, Thad Guidry thadguidry@gmail.com a écrit :

...

Hi David and team,

In Yi Liu's tool, Wikidata Property Explorer, I noticed that the query performance could be better ideally. Currently the query takes about 9 seconds and I'm asking if there might be anything to help reduce that considerably? Refactoring query for optimization, backend changes, anything you can think of Davd?

SELECT DISTINCT ?prop ?label ?desc ?type (GROUP_CONCAT(DISTINCT ?alias; SEPARATOR = " | ") AS ?aliases) WHERE { ?prop (wdt:P31/(wdt:P279*)) wd:Q18616576; wikibase:propertyType ?type. OPTIONAL { ?prop rdfs:label ?label. FILTER((LANG(?label)) = "en") } OPTIONAL { ?prop schema:description ?desc. FILTER((LANG(?desc)) = "en") } OPTIONAL { ?prop skos:altLabel ?alias. FILTER((LANG(?alias)) = "en") } } GROUP BY ?prop ?label ?desc ?type

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/ _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Kingsley Idehen

3 Nov 3 Nov

5:59 p.m.

On 10/29/21 10:11 AM, Thad Guidry wrote:

...

Hi David and team,

In Yi Liu's tool, Wikidata Property Explorer, I noticed that the query performance could be better ideally. Currently the query takes about 9 seconds and I'm asking if there might be anything to help reduce that considerably? Refactoring query for optimization, backend changes, anything you can think of Davd?

SELECT DISTINCT ?prop ?label ?desc ?type (GROUP_CONCAT(DISTINCT ?alias; SEPARATOR = " | ") AS ?aliases) WHERE { ?prop (wdt:P31/(wdt:P279*)) wd:Q18616576; wikibase:propertyType ?type. OPTIONAL { ?prop rdfs:label ?label. FILTER((LANG(?label)) = "en") } OPTIONAL { ?prop schema:description ?desc. FILTER((LANG(?desc)) = "en") } OPTIONAL { ?prop skos:altLabel ?alias. FILTER((LANG(?alias)) = "en") } } GROUP BY ?prop ?label ?desc ?type

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

Wikidata mailing list --wikidata@lists.wikimedia.org To unsubscribe send an email towikidata-leave@lists.wikimedia.org

Hi Thad,

Don't know what your expectations are, but here are results from our Wikidata instance:

* Query Solution Page with "Anytime Query" Feature Enabled https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=http%3A%2F%2Fwww.wikidata.org%2F&query=SELECT+DISTINCT+%3Fprop+%3Flabel+%3Fdesc+%3Ftype+%28GROUP_CONCAT%28DISTINCT+%3Falias%3B+SEPARATOR+%3D+%22+%7C+%22%29+AS+%3Faliases%29+WHERE+%7B%0D%0A++%3Fprop+%28wdt%3AP31%2F%28wdt%3AP279*%29%29+wd%3AQ18616576%3B%0D%0A++++wikibase%3ApropertyType+%3Ftype.%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+rdfs%3Alabel+%3Flabel.%0D%0A++++FILTER%28%28LANG%28%3Flabel%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+schema%3Adescription+%3Fdesc.%0D%0A++++FILTER%28%28LANG%28%3Fdesc%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+skos%3AaltLabel+%3Falias.%0D%0A++++FILTER%28%28LANG%28%3Falias%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A%7D%0D%0AGROUP+BY+%3Fprop+%3Flabel+%3Fdesc+%3Ftype%0D%0A&format=text%2Fx-html%2Btr&timeout=360000&signal_void=on&signal_unconnected=on * Query Solution Page with "Anytime Query" Feature Disabled https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=http%3A%2F%2Fwww.wikidata.org%2F&query=SELECT+DISTINCT+%3Fprop+%3Flabel+%3Fdesc+%3Ftype+%28GROUP_CONCAT%28DISTINCT+%3Falias%3B+SEPARATOR+%3D+%22+%7C+%22%29+AS+%3Faliases%29+WHERE+%7B%0D%0A++%3Fprop+%28wdt%3AP31%2F%28wdt%3AP279*%29%29+wd%3AQ18616576%3B%0D%0A++++wikibase%3ApropertyType+%3Ftype.%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+rdfs%3Alabel+%3Flabel.%0D%0A++++FILTER%28%28LANG%28%3Flabel%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+schema%3Adescription+%3Fdesc.%0D%0A++++FILTER%28%28LANG%28%3Fdesc%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A++OPTIONAL+%7B%0D%0A++++%3Fprop+skos%3AaltLabel+%3Falias.%0D%0A++++FILTER%28%28LANG%28%3Falias%29%29+%3D+%22en%22%29%0D%0A++%7D%0D%0A%7D%0D%0AGROUP+BY+%3Fprop+%3Flabel+%3Fdesc+%3Ftype%0D%0A&format=text%2Fx-html%2Btr&timeout=0&signal_void=on&signal_unconnected=on

Hope this helps.

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Thad Guidry

6:38 p.m.

Thanks Kingsley, Thomas, Jeff,

From what I see the live query never is sub second and that's likely because of 2 things: 1. indexing not prioritizing this kind of query and aligning it (which David Causse might know if that could be changed), essentially its metadata about Wikidata (it's available properties). 2. it's 2.2 MB of data

I think that Yi Liu's Wikidata Property Explorer service then might want to instead cache the results for 24 hours for the best of both worlds.

To be fair, the raw amount of data requested seems to be approximately 2.2 MB and so probably should be locally cached by his tool for some determined time (like 24 hours).

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

David Causse

5 Nov 5 Nov

6:54 a.m.

Hi Thad,

I looked at this query and I have nothing to add to what was suggested already to make it run faster. I think the main issue is the size of the intermediate results that have to have the language filter applied, sadly almost every time that a FILTER is being used on a string literal blazegraph might have to fetch its representation from its lexicon which incur a huge slowdown. Regarding indices and ordering I believe the right indices are being used otherwize the query would certainly time out, I doubt it can filter all english labels before joining them to the property labels.

The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem useless to me and is pulling a couple false positives[1] into the join (totally harmless regarding query perf but should perhaps be cleaned up from wikidata?).

So filtering & fetching the textual data is indeed what makes this query slow. I tried various combinations but could not come up with reasonable & stable sub-second response times. Fetching the textual data (possibly lazily) from another service might help but this certainly is a consequent rewrite of the client relying on this query.

Caching is definitely going to help especially if this data is not subject to rapid/frequent changes, the WDQS infrastructure has a caching layer but retention might not be long enough to be useful for this particular tool. The json output seems indeed quite big (almost 5Mb), while not enormous it's still consequent and if this data is relatively stable there might be value in refreshing it on purpose (daily as you suggest) and making it available on a static storage.

Another note about response times, you may see varying response times from the query service and the reasons might be one of the following: - it's cached on the query service caching layer (generally sub 100ms response time) - the server the query hits is heavily loaded - the server the query hits is an old generation (we have 2 different kinds of hardware setup in the cluster at the moment and might explain some of the variance you see).

Hope it helps a bit,

Regards,

David.

1: https://w.wiki/4Lae

On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry thadguidry@gmail.com wrote:

...

Thanks Kingsley, Thomas, Jeff,

From what I see the live query never is sub second and that's likely because of 2 things:

indexing not prioritizing this kind of query and aligning it (which

David Causse might know if that could be changed), essentially its metadata about Wikidata (it's available properties). 2. it's 2.2 MB of data

I think that Yi Liu's Wikidata Property Explorer service then might want to instead cache the results for 24 hours for the best of both worlds.

To be fair, the raw amount of data requested seems to be approximately 2.2 MB and so probably should be locally cached by his tool for some determined time (like 24 hours).

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Thad Guidry

10:22 a.m.

Thank you so much for the insight David!

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Fri, Nov 5, 2021 at 5:55 AM David Causse dcausse@wikimedia.org wrote:

...

Hi Thad,

I looked at this query and I have nothing to add to what was suggested already to make it run faster. I think the main issue is the size of the intermediate results that have to have the language filter applied, sadly almost every time that a FILTER is being used on a string literal blazegraph might have to fetch its representation from its lexicon which incur a huge slowdown. Regarding indices and ordering I believe the right indices are being used otherwize the query would certainly time out, I doubt it can filter all english labels before joining them to the property labels.

The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem useless to me and is pulling a couple false positives[1] into the join (totally harmless regarding query perf but should perhaps be cleaned up from wikidata?).

So filtering & fetching the textual data is indeed what makes this query slow. I tried various combinations but could not come up with reasonable & stable sub-second response times. Fetching the textual data (possibly lazily) from another service might help but this certainly is a consequent rewrite of the client relying on this query.

Caching is definitely going to help especially if this data is not subject to rapid/frequent changes, the WDQS infrastructure has a caching layer but retention might not be long enough to be useful for this particular tool. The json output seems indeed quite big (almost 5Mb), while not enormous it's still consequent and if this data is relatively stable there might be value in refreshing it on purpose (daily as you suggest) and making it available on a static storage.

Another note about response times, you may see varying response times from the query service and the reasons might be one of the following:

it's cached on the query service caching layer (generally sub 100ms

response time)

the server the query hits is heavily loaded

the server the query hits is an old generation (we have 2 different

kinds of hardware setup in the cluster at the moment and might explain some of the variance you see).

Hope it helps a bit,

Regards,

David.

1: https://w.wiki/4Lae

On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry thadguidry@gmail.com wrote:

...
Thanks Kingsley, Thomas, Jeff,

From what I see the live query never is sub second and that's likely because of 2 things:

indexing not prioritizing this kind of query and aligning it (which

David Causse might know if that could be changed), essentially its metadata about Wikidata (it's available properties). 2. it's 2.2 MB of data

I think that Yi Liu's Wikidata Property Explorer service then might want to instead cache the results for 24 hours for the best of both worlds.

To be fair, the raw amount of data requested seems to be approximately 2.2 MB and so probably should be locally cached by his tool for some determined time (like 24 hours).

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Thomas Douillard

10:45 a.m.

Wikidata has a huge number of labels in a high number of languages. Is it be possible that indexing strategies based on the language of the string literal a good thing ? It’s an RDF choice to encode the language in the literal, it might not be the better choice for performance indeed. But a query planner/rewriter should be able to detect a pattern like « filter lang() = "en" » to take advantage of such an index ?

Retrieving label is important in general and do this efficiently might be a something that makes a difference …

Le ven. 5 nov. 2021 à 11:55, David Causse dcausse@wikimedia.org a écrit :

...

Hi Thad,

I looked at this query and I have nothing to add to what was suggested already to make it run faster. I think the main issue is the size of the intermediate results that have to have the language filter applied, sadly almost every time that a FILTER is being used on a string literal blazegraph might have to fetch its representation from its lexicon which incur a huge slowdown. Regarding indices and ordering I believe the right indices are being used otherwize the query would certainly time out, I doubt it can filter all english labels before joining them to the property labels.

The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem useless to me and is pulling a couple false positives[1] into the join (totally harmless regarding query perf but should perhaps be cleaned up from wikidata?).

So filtering & fetching the textual data is indeed what makes this query slow. I tried various combinations but could not come up with reasonable & stable sub-second response times. Fetching the textual data (possibly lazily) from another service might help but this certainly is a consequent rewrite of the client relying on this query.

Caching is definitely going to help especially if this data is not subject to rapid/frequent changes, the WDQS infrastructure has a caching layer but retention might not be long enough to be useful for this particular tool. The json output seems indeed quite big (almost 5Mb), while not enormous it's still consequent and if this data is relatively stable there might be value in refreshing it on purpose (daily as you suggest) and making it available on a static storage.

Another note about response times, you may see varying response times from the query service and the reasons might be one of the following:

it's cached on the query service caching layer (generally sub 100ms

response time)

the server the query hits is heavily loaded

the server the query hits is an old generation (we have 2 different

kinds of hardware setup in the cluster at the moment and might explain some of the variance you see).

Hope it helps a bit,

Regards,

David.

1: https://w.wiki/4Lae

On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry thadguidry@gmail.com wrote:

...
Thanks Kingsley, Thomas, Jeff,

From what I see the live query never is sub second and that's likely because of 2 things:

indexing not prioritizing this kind of query and aligning it (which

David Causse might know if that could be changed), essentially its metadata about Wikidata (it's available properties). 2. it's 2.2 MB of data

I think that Yi Liu's Wikidata Property Explorer service then might want to instead cache the results for 24 hours for the best of both worlds.

To be fair, the raw amount of data requested seems to be approximately 2.2 MB and so probably should be locally cached by his tool for some determined time (like 24 hours).

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

David Causse

11:11 a.m.

On Fri, Nov 5, 2021 at 3:46 PM Thomas Douillard thomas.douillard@gmail.com wrote:

...

[...]

But a query planner/rewriter should be able to detect a pattern like «

...

filter lang() = "en" » to take advantage of such an index ?

With how blazegraph works it is hard to apply filters on literals unless the data to filter is "inlined" in the btree. Unfortunately what blazegraph offers is inlining the whole literal not just the lang tag and inlining all string literals is definitely not possible for wikidata, the btree would explode. Perhaps there are other triple stores that could inline solely the language tag that would make this query faster?

While I agree that labels are important they are rarely used as a criteria to traverse the graph but often used as a first step to "search" or as a last step to "display", this is why I wonder if it would not be a better approach to federate multiple data APIs:

- the query service for pure graph traversal - wikidata search APIs for label/alias/description searching - upcoming wikidata REST api for label/alias/description fetching

David.

1158

Age (days ago)

1165

Last active (days ago)

wikidata@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

David Causse
Kingsley Idehen
Thad Guidry
Thomas Douillard
Young,Jeff (OR)