[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

23 Feb 2023

On 2/23/23 12:19 PM, James Heald wrote:
...
  On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via
Wikidata  wrote:

  On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

 The exposed SPARQL endpoint is at the moment a direct exposition of 
 the Blazegraph endpoint, so it does expose all the Blazegraph 
 specific features and quirks.

 Is there a Query Service that's separated from the Blazegraph 
 endpoint? The crux of the matter here is that WDQS benefits more by 
 being loosely- bound to endpoints rather than tightly-bound to the 
 Blazegraph endpoint.

>
> What we would like to do at some point (this is not more than a 
> rough idea at this point) is to add a proxy in front of the SPARQL 
> endpoint, that would filter specific SPARQL features, so that we 
> limit what is available to a standard set of features available 
> across most potential backends. This would help reduce the coupling 
> of queries with the backend. Of course, this would have the drawback 
> of limiting the feature set.
>

Hi James,

...

 I have to say I am a bit concerned by this talk, since some of 
 Blazegraph's "features and quirks" can be exceedingly useful.

That isn't justification for tightly-coupling a Query Tool to a Query 
Service Endpoint, especially when an open standard (in the form of 
SPARQL) exists.

...

 In particular I would highlight **named subqueries** and 
 **Blazegraph's bd:sample service** as two "features and quirks" which 
 should not be suppressed lightly.

See my comment above.

...

 Use of named subqueries (ie queries that include an "INCLUDE 
 %subquery" line) is consistently popular in the "query of the week" 
 example queries featured in the weekly summary, and for good reasons:

 * they can make complex long queries far more readable
 * they can make optimisation of complex long queries a lot easier and 
 a lot more transparent (or even possible at all)
 * they can be essential to the performance of some queries, if there 
 is a particular retrieved set that those queries then recall to reuse 
 in more than one way.

 The Blazegraph syntax for this is elegant.

See my comments above, which are about architecture fundamentals and the 
virtues of loose-coupling.

...
  Ideally the dev teams of candidate replacements should
be encouraged 
 to support it.  Failing that at the very least a preprocessor should 
 be written to suitably adapt queries with an INCLUDE directive, so 
 that existing queries can continue to run.

 In contrast, bd:sample is perhaps under-used and under-appreciated and 
 not so well known, but can also be very valuable.

 It allows to a query writer to get a genuinely random sampling of the 
 usage of a particular triple.

 For example, here's a query https://w.wiki/6NHo that I was asked for 
 recently, that finds the most common classes of items used as values 
 for P180 'depicts' statements on Commons.

 Sampling is essential here because there are now in excess of 19.8 
 million P180 statements on Commons -- and becomes even more so because 
 of the federated nature of the query, which means that only a few tens 
 of thousands of data at most can be passed for analysis into any 
 subquery to be run on wdqs against wikidata.

 A feature like bd:sample is the only way to be able to do this kind of 
 analysis of structured data statements on Commons.

 I regard named subqueries and bd:sample as particularly important. But 
 beyond them, we need to make sure that any 'filter' does not remove 
 Blazegraph optimiser directives, as if those don't get through to 
 Blazegraph many queries that rely on them simply will not run 
 (especially if named subqueries have also been made unavailable).

 Ways also need to be found to make sure the geographical services 
 wikibase:around() and wikibase:box() continue to be available, the 
 distance function geof:distance(), and the mwapi and labelling services.

 Best regards,

    James.

Please digest my comments above, since they have nothing to do with how 
Blazegraph implements its Query Service Endpoint :)

-- 
Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS