On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
The exposed SPARQL endpoint is at the moment a direct exposition of
the Blazegraph endpoint, so it does expose all the Blazegraph specific
features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint?
The crux of the matter here is that WDQS benefits more by being loosely-
bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
>
> What we would like to do at some point (this is not more than a rough
> idea at this point) is to add a proxy in front of the SPARQL endpoint,
> that would filter specific SPARQL features, so that we limit what is
> available to a standard set of features available across most
> potential backends. This would help reduce the coupling of queries
> with the backend. Of course, this would have the drawback of limiting
> the feature set.
>
I have to say I am a bit concerned by this talk, since some of
Blazegraph's "features and quirks" can be exceedingly useful.
In particular I would highlight **named subqueries** and **Blazegraph's
bd:sample service** as two "features and quirks" which should not be
suppressed lightly.
Use of named subqueries (ie queries that include an "INCLUDE %subquery"
line) is consistently popular in the "query of the week" example queries
featured in the weekly summary, and for good reasons:
* they can make complex long queries far more readable
* they can make optimisation of complex long queries a lot easier and a
lot more transparent (or even possible at all)
* they can be essential to the performance of some queries, if there is
a particular retrieved set that those queries then recall to reuse in
more than one way.
The Blazegraph syntax for this is elegant. Ideally the dev teams of
candidate replacements should be encouraged to support it. Failing that
at the very least a preprocessor should be written to suitably adapt
queries with an INCLUDE directive, so that existing queries can continue
to run.
In contrast, bd:sample is perhaps under-used and under-appreciated and
not so well known, but can also be very valuable.
It allows to a query writer to get a genuinely random sampling of the
usage of a particular triple.
For example, here's a query
https://w.wiki/6NHo that I was asked for
recently, that finds the most common classes of items used as values for
P180 'depicts' statements on Commons.
Sampling is essential here because there are now in excess of 19.8
million P180 statements on Commons -- and becomes even more so because
of the federated nature of the query, which means that only a few tens
of thousands of data at most can be passed for analysis into any
subquery to be run on wdqs against wikidata.
A feature like bd:sample is the only way to be able to do this kind of
analysis of structured data statements on Commons.
I regard named subqueries and bd:sample as particularly important. But
beyond them, we need to make sure that any 'filter' does not remove
Blazegraph optimiser directives, as if those don't get through to
Blazegraph many queries that rely on them simply will not run
(especially if named subqueries have also been made unavailable).
Ways also need to be found to make sure the geographical services
wikibase:around() and wikibase:box() continue to be available, the
distance function geof:distance(), and the mwapi and labelling services.
Best regards,
James.