Thoughts on (not) exposing a SPARQL endpoint

List overview All Threads
Download

newer

older

Wikidata Query implementation -...

bio2rdf, wikidata and SPARQL

Daniel Kinzler

10 Mar 2015 10 Mar '15

2:31 p.m.

Hi all! After the initial enthusiasm, I have grown increasingly wary of the prospect of exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to share my (personal and unfinished) thoughts about this on this list, as food for thought and a basis for discussion. Basically, I fear that exposing SPARQL will lock us in with respect to the backend technology we use. Once it's there, people will rely on it, and taking it away would be very harsh. That would make it practically impossible to move to, say, Neo4J in the future. This is even more true if if expose vendor specific extensions like RDR/SPARQL*. Also, exposing SPARQL as our primary query interface probably means abruptly discontinuing support for WDQ. It's pretty clear that the original WDQ service is not going to be maintained once the WMF offers infrastructure for wikidata queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will need major modifications, or would just die. So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation. But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could be exposed to Labs, just like we provide access to replicated SQL databases there: on Labs, you get "raw" access, with added performance and flexibility, but no guarantees about interface stability. In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows. Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically tailored to our domain and use case, and there already is an ecosystem of tools that use it. We'd want to refine it a bit I suppose, but by and large, it's pretty much exactly what we need, because it was built around the actual demand for querying wikidata. So far my current thoughts. Note that this is not a decision or recommendation by the Wikidata team, just my personal take. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Show replies by date

Markus Krötzsch

10 Mar 10 Mar

3:47 p.m.

Hi Daniel, I can understand your thoughts to some extent, but they seem to apply to any potential solution. Committing to a primary query interface will always be, well, a committment. Because of this, I think the big advantage of SPARQL is exactly that it is a technology standard that is not depending on a specific tool. If you want to minimize lock-in and be maximally future-safe, this seems to be a good thing. No decision should be taken lightly, but I do not see any viable alternatives right now short of not offering any public query service (and keeping queries restricted to in-Wikipedia use). I would certainly not support the use of a tool-specific query language that is not specified anywhere but in running code. With SPARQL, you have a lot of tutorials, textbooks, and free (both in an accessibility and in an IP sense) standards. Moreover, you have broad support by a range of applications, organisations and companies. Similar advantages exist for other languages, but none I can think of now would be as appropriate (XQuery and SQL come to mind, but SPARQL would be much closer to our data model, even if not a perfect fit). WDQ is great but it is a custom API of a single tool rather than a query language. I see easy ways to address your two main concerns: * "WDQ would go away": That's not a worry I have at all. It will be easy to write an adaptor for WDQ to SPARQL and to keep up the service as it is for a long time. In fact, Magnus recently mentioned this as a plan of his on the mailing list. SPARQL has all features that WDQ has (most interesting here are regular path queries, which are called "TREE" by WDQ). * "SPARQL would be too expressive, or could have non-standard extensions that are hard to support in the future": This can be addressed in two ways. The soft way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these. The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of "safe" features. This is easy to do, since SPARQL query parsing and reformulation is already part of any DBMS that supports such queries, and it would be easy to hook into this process to restrict queries without any notable performance overhead. This would minimize vendor lock-in, since one would only commit to (a subset of) the fully standardized features. With both of these in place, your concerns should be addressed without us having to build our own query language from scratch (including parsers, preprocessors, optimizers, user documentation, ...). Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide all of these details. Right now the main priority should be to get something running rather than to go back to the drawing board. Cheers, Markus On 10.03.2015 15:31, Daniel Kinzler wrote:

...

Daniel Kinzler

4:06 p.m.

Am 10.03.2015 um 16:47 schrieb Markus Krötzsch:

...

Committing the the broadest possible interface, even if it's a standard, is the problem I see, because it makes swapping out the backend close to impossible. I propose committing to an interface that is as narrow as it can be for our use case. That's general best practice in system design, I believe. Note that we are not only committing to a (standardized, but very complex) query language, but also to our data mapping. WDQ would abstract from that, and give us wiggle room to adjust the mapping later.

...

I would certainly not support the use of a tool-specific query language that is not specified anywhere but in running code.

Of course the language would need to be well specified, and modified in places. We'd want a production grammar, and a decent parser (recursive descend, probably).

...

WDQ is great but it is a custom API of a single tool rather than a query language.

It would be our Domain Specific Language. There's a lot to be said for DSLs, if they are well documented.

...

* "WDQ would go away": That's not a worry I have at all. It will be easy to write an adaptor for WDQ to SPARQL and to keep up the service as it is for a long time.

That is exactly what I'm proposing. I'd just say that the WDQ version would the canonical one, while the SPARQL one would be considered raw/unstable, like the SQL databases on labs.

...

* "SPARQL would be too expressive, or could have non-standard extensions that are hard to support in the future": This can be addressed in two ways. The soft way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these.

This documentation is unlikely to be complete, and people will use what ever "works now", and complain when it breaks. They *will* use vendor specific features and optimizations, even if you tell them they shouldn't. And there will be trouble when they break.

...

The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of "safe" features. This is easy to do, since SPARQL query parsing and reformulation is already part of any DBMS that supports such queries, and it would be easy to hook into this process to restrict queries without any notable performance overhead. This would minimize vendor lock-in, since one would only commit to (a subset of) the fully standardized features.

That is the plan for sandboxing SPARQL. It's doable, but not easy. Implementing "safe" WDQ on top of SPARQL is going to be simpler and quicker, I think. It will give us a public query interface *faster*.

...

With both of these in place, your concerns should be addressed without us having to build our own query language from scratch (including parsers, preprocessors, optimizers, user documentation, ...).

With WDQ on to of SPARQL, we need a parser and a SPARQL emitter, that's it. Documentation is already there (well, to a degree), and optimization is provided by the SPARQL endpoint.

...

Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide all of these details. Right now the main priority should be to get something running rather than to go back to the drawing board.

Magnus Manske

5:01 p.m.

Some thoughts: * Either way, there will be a WDQ-like wrapper around SPARQL, maybe as the official interface, maybe only at the current WDQ URL (and I'll have to read up on SPARQL to write that, so if someone else writes it for me, all the better!) * WDQ syntax is very limited (no references, no variables, etc), but it covers a large amount of use cases at this point in time * A WDQ wrapper could add some sought-after functionality quite easily (regular expression label matching comes to mind), but it is probably not a long-term solution, given its limitations * A WDQ-syntax interface would be a great proof-of-concept that the new solution can, at the very least, do what the current one does On Tue, Mar 10, 2015 at 4:06 PM Daniel Kinzler <daniel.kinzler(a)wikimedia.de> wrote:

...

Am 10.03.2015 um 16:47 schrieb Markus Krötzsch:

Hi Daniel, I can understand your thoughts to some extent, but they seem to apply to

any

potential solution. Committing to a primary query interface will always

be,

well, a committment. Because of this, I think the big advantage of

SPARQL is

exactly that it is a technology standard that is not depending on a

specific

tool. If you want to minimize lock-in and be maximally future-safe, this

seems

to be a good thing.

I would certainly not support the use of a tool-specific query language that is not specified anywhere but in

running code. Of course the language would need to be well specified, and modified in places. We'd want a production grammar, and a decent parser (recursive descend, probably).

WDQ is great but it is a custom API of a single tool rather than a query language.

It would be our Domain Specific Language. There's a lot to be said for DSLs, if they are well documented.

* "WDQ would go away": That's not a worry I have at all. It will be easy

write an adaptor for WDQ to SPARQL and to keep up the service as it is

for a

long time.

That is exactly what I'm proposing. I'd just say that the WDQ version would the canonical one, while the SPARQL one would be considered raw/unstable, like the SQL databases on labs.

* "SPARQL would be too expressive, or could have non-standard extensions

that

are hard to support in the future": This can be addressed in two ways.

The soft

way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these.

The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of

"safe"

features. This is easy to do, since SPARQL query parsing and

reformulation is

already part of any DBMS that supports such queries, and it would be

easy to

hook into this process to restrict queries without any notable

performance

overhead. This would minimize vendor lock-in, since one would only

commit to (a

subset of) the fully standardized features.

With both of these in place, your concerns should be addressed without

us having

to build our own query language from scratch (including parsers,

preprocessors,

optimizers, user documentation, ...).

With WDQ on to of SPARQL, we need a parser and a SPARQL emitter, that's it. Documentation is already there (well, to a degree), and optimization is provided by the SPARQL endpoint.

Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide

all of

these details. Right now the main priority should be to get something

running

rather than to go back to the drawing board.

Yes, absolutely, but what we make available publicly 1) has to be safe - I believe this is easier and faster to do with WDQ. 2) should be future proof - again, easier with WDQ, because it's more restrictive and domain specific. It allows us to change the underlying mapping or technology. SPARQL doesn't easily. In any case, I'm not saying we shouldn't make a SPARQL endpoint available at all. I'm saying it should not be the canonical query interface, but rather a "raw" query interface. That would give us a lot more headroom to change things later, without breaking a lot of 3rd party code. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata-tech mailing list Wikidata-tech(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Thomas Tanon

5:22 p.m.

I support Magnus point of view. WDQ is a very good proof of concept but is, I think, to limited to be the primary language of the Wikidata query system. A possible solution is maybe to support two query languages "as primary": 1 WDQ, at first, in order to have something working quickly 2 A safe subset of SPARQL (if it is possible) that would be implemented later using the experience got form the deployment of the first version of the query system. Or, if it is not possible, an improved version of WDQ that would break its current limitations. With that I think we have the best of the two worlds: 1. A simple language (WDQ) that allows a short road to production and keeps compatibility with previous systems. 2. A powerful language for advanced uses. 3. Having from scratch the assumption that more than one query language may be used, assumption that may be very useful in the future if we want to change again. Thomas

...

Le 10 mars 2015 à 18:01, Magnus Manske <magnusmanske(a)googlemail.com> a écrit : Some thoughts: * Either way, there will be a WDQ-like wrapper around SPARQL, maybe as the official interface, maybe only at the current WDQ URL (and I'll have to read up on SPARQL to write that, so if someone else writes it for me, all the better!) * WDQ syntax is very limited (no references, no variables, etc), but it covers a large amount of use cases at this point in time * A WDQ wrapper could add some sought-after functionality quite easily (regular expression label matching comes to mind), but it is probably not a long-term solution, given its limitations * A WDQ-syntax interface would be a great proof-of-concept that the new solution can, at the very least, do what the current one does On Tue, Mar 10, 2015 at 4:06 PM Daniel Kinzler <daniel.kinzler(a)wikimedia.de> wrote: Am 10.03.2015 um 16:47 schrieb Markus Krötzsch: