Thank you so much for the insight David!
On Fri, Nov 5, 2021 at 5:55 AM David Causse wrote:
Hi Thad,
I looked at this query and I have nothing to add to what was suggested already to make it run faster. I think the main issue is the size of the intermediate results that have to have the language filter applied, sadly almost every time that a FILTER is being used on a string literal blazegraph might have to fetch its representation from its lexicon which incur a huge slowdown. Regarding indices and ordering I believe the right indices are being used otherwize the query would certainly time out, I doubt it can filter all english labels before joining them to the property labels.
The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem useless to me and is pulling a couple false positives[1] into the join (totally harmless regarding query perf but should perhaps be cleaned up from wikidata?).
So filtering & fetching the textual data is indeed what makes this query slow. I tried various combinations but could not come up with reasonable & stable sub-second response times. Fetching the textual data (possibly lazily) from another service might help but this certainly is a consequent rewrite of the client relying on this query.
Caching is definitely going to help especially if this data is not subject to rapid/frequent changes, the WDQS infrastructure has a caching layer but retention might not be long enough to be useful for this particular tool. The json output seems indeed quite big (almost 5Mb), while not enormous it's still consequent and if this data is relatively stable there might be value in refreshing it on purpose (daily as you suggest) and making it available on a static storage.
Another note about response times, you may see varying response times from the query service and the reasons might be one of the following:
- it's cached on the query service caching layer (generally sub 100ms
response time)
- the server the query hits is heavily loaded
- the server the query hits is an old generation (we have 2 different
kinds of hardware setup in the cluster at the moment and might explain some of the variance you see).
Hope it helps a bit,
On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry wrote:
Thanks Kingsley, Thomas, Jeff,
From what I see the live query never is sub second and that's likely because of 2 things:
- indexing not prioritizing this kind of query and aligning it (which
David Causse might know if that could be changed), essentially its metadata about Wikidata (it's available properties). 2. it's 2.2 MB of data
I think that Yi Liu's Wikidata Property Explorer service then might want to instead cache the results for 24 hours for the best of both worlds.
To be fair, the raw amount of data requested seems to be approximately 2.2 MB and so probably should be locally cached by his tool for some determined time (like 24 hours).
Wikidata mailing list -- To unsubscribe send an email to
Wikidata mailing list -- To unsubscribe send an email to