Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

10 Mar 2015

Hi Daniel,

I can understand your thoughts to some extent, but they seem to apply to 
any potential solution. Committing to a primary query interface will 
always be, well, a committment. Because of this, I think the big 
advantage of SPARQL is exactly that it is a technology standard that is 
not depending on a specific tool. If you want to minimize lock-in and be 
maximally future-safe, this seems to be a good thing.

No decision should be taken lightly, but I do not see any viable 
alternatives right now short of not offering any public query service 
(and keeping queries restricted to in-Wikipedia use). I would certainly 
not support the use of a tool-specific query language that is not 
specified anywhere but in running code. With SPARQL, you have a lot of 
tutorials, textbooks, and free (both in an accessibility and in an IP 
sense) standards. Moreover, you have broad support by a range of 
applications, organisations and companies. Similar advantages exist for 
other languages, but none I can think of now would be as appropriate 
(XQuery and SQL come to mind, but SPARQL would be much closer to our 
data model, even if not a perfect fit). WDQ is great but it is a custom 
API of a single tool rather than a query language.

I see easy ways to address your two main concerns:

* "WDQ would go away": That's not a worry I have at all. It will be easy 
to write an adaptor for WDQ to SPARQL and to keep up the service as it 
is for a long time. In fact, Magnus recently mentioned this as a plan of 
his on the mailing list. SPARQL has all features that WDQ has (most 
interesting here are regular path queries, which are called "TREE" by WDQ).

* "SPARQL would be too expressive, or could have non-standard extensions 
that are hard to support in the future": This can be addressed in two 
ways. The soft way is to document clearly which features are supported, 
and to maintain backwards compatibility only wrt these. The hard (as in 
firm, not as in difficult) way is to restrict queries to use only such a 
limited set of "safe" features. This is easy to do, since SPARQL query 
parsing and reformulation is already part of any DBMS that supports such 
queries, and it would be easy to hook into this process to restrict 
queries without any notable performance overhead. This would minimize 
vendor lock-in, since one would only commit to (a subset of) the fully 
standardized features.

With both of these in place, your concerns should be addressed without 
us having to build our own query language from scratch (including 
parsers, preprocessors, optimizers, user documentation, ...). Moreover, 
both of these can be added at any stage of the project, so we are not 
blocked now by having to decide all of these details. Right now the main 
priority should be to get something running rather than to go back to 
the drawing board.

Cheers,

Markus

On 10.03.2015 15:31, Daniel Kinzler wrote:
...
  Hi all!

 After the initial enthusiasm, I have grown increasingly wary of the prospect of
 exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to
 share my (personal and unfinished) thoughts about this on this list, as food for
 thought and a basis for discussion.

 Basically, I fear that exposing SPARQL will lock us in with respect to the
 backend technology we use. Once it's there, people will rely on it, and taking
 it away would be very harsh. That would make it practically impossible to move
 to, say, Neo4J in the future. This is even more true if if expose vendor
 specific extensions like RDR/SPARQL*.

 Also, exposing SPARQL as our primary query interface probably means abruptly
 discontinuing support for WDQ. It's pretty clear that the original WDQ service
 is not going to be maintained once the WMF offers infrastructure for wikidata
 queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will
 need major modifications, or would just die.

 So, my proposal is to expose a WDQ-like service as our primary query interface.
 This follows the general principle having narrow interfaces to make it easy to
 swap out the implementation.

 But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could
 be exposed to Labs, just like we provide access to replicated SQL databases
 there: on Labs, you get "raw" access, with added performance and flexibility,
 but no guarantees about interface stability.

 In terms of development resources and timeline, exposing WDQ may actually get us
 a public query endpoint more quickly: sandboxing full SPARQL may likely turn out
 to be a lot harder than sandboxing the more limited set of queries WDQ allows.

 Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically
 tailored to our domain and use case, and there already is an ecosystem of tools
 that use it. We'd want to refine it a bit I suppose, but by and large, it's
 pretty much exactly what we need, because it was built around the actual demand
 for querying wikidata.

 So far my current thoughts. Note that this is not a decision or recommendation
 by the Wikidata team, just my personal take.

 -- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint