Hi all!
After the initial enthusiasm, I have grown increasingly wary of the prospect of exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to share my (personal and unfinished) thoughts about this on this list, as food for thought and a basis for discussion.
Basically, I fear that exposing SPARQL will lock us in with respect to the backend technology we use. Once it's there, people will rely on it, and taking it away would be very harsh. That would make it practically impossible to move to, say, Neo4J in the future. This is even more true if if expose vendor specific extensions like RDR/SPARQL*.
Also, exposing SPARQL as our primary query interface probably means abruptly discontinuing support for WDQ. It's pretty clear that the original WDQ service is not going to be maintained once the WMF offers infrastructure for wikidata queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will need major modifications, or would just die.
So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation.
But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could be exposed to Labs, just like we provide access to replicated SQL databases there: on Labs, you get "raw" access, with added performance and flexibility, but no guarantees about interface stability.
In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows.
Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically tailored to our domain and use case, and there already is an ecosystem of tools that use it. We'd want to refine it a bit I suppose, but by and large, it's pretty much exactly what we need, because it was built around the actual demand for querying wikidata.
So far my current thoughts. Note that this is not a decision or recommendation by the Wikidata team, just my personal take.
-- daniel
Hi Daniel,
I can understand your thoughts to some extent, but they seem to apply to any potential solution. Committing to a primary query interface will always be, well, a committment. Because of this, I think the big advantage of SPARQL is exactly that it is a technology standard that is not depending on a specific tool. If you want to minimize lock-in and be maximally future-safe, this seems to be a good thing.
No decision should be taken lightly, but I do not see any viable alternatives right now short of not offering any public query service (and keeping queries restricted to in-Wikipedia use). I would certainly not support the use of a tool-specific query language that is not specified anywhere but in running code. With SPARQL, you have a lot of tutorials, textbooks, and free (both in an accessibility and in an IP sense) standards. Moreover, you have broad support by a range of applications, organisations and companies. Similar advantages exist for other languages, but none I can think of now would be as appropriate (XQuery and SQL come to mind, but SPARQL would be much closer to our data model, even if not a perfect fit). WDQ is great but it is a custom API of a single tool rather than a query language.
I see easy ways to address your two main concerns:
* "WDQ would go away": That's not a worry I have at all. It will be easy to write an adaptor for WDQ to SPARQL and to keep up the service as it is for a long time. In fact, Magnus recently mentioned this as a plan of his on the mailing list. SPARQL has all features that WDQ has (most interesting here are regular path queries, which are called "TREE" by WDQ).
* "SPARQL would be too expressive, or could have non-standard extensions that are hard to support in the future": This can be addressed in two ways. The soft way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these. The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of "safe" features. This is easy to do, since SPARQL query parsing and reformulation is already part of any DBMS that supports such queries, and it would be easy to hook into this process to restrict queries without any notable performance overhead. This would minimize vendor lock-in, since one would only commit to (a subset of) the fully standardized features.
With both of these in place, your concerns should be addressed without us having to build our own query language from scratch (including parsers, preprocessors, optimizers, user documentation, ...). Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide all of these details. Right now the main priority should be to get something running rather than to go back to the drawing board.
Cheers,
Markus
On 10.03.2015 15:31, Daniel Kinzler wrote:
Hi all!
After the initial enthusiasm, I have grown increasingly wary of the prospect of exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to share my (personal and unfinished) thoughts about this on this list, as food for thought and a basis for discussion.
Basically, I fear that exposing SPARQL will lock us in with respect to the backend technology we use. Once it's there, people will rely on it, and taking it away would be very harsh. That would make it practically impossible to move to, say, Neo4J in the future. This is even more true if if expose vendor specific extensions like RDR/SPARQL*.
Also, exposing SPARQL as our primary query interface probably means abruptly discontinuing support for WDQ. It's pretty clear that the original WDQ service is not going to be maintained once the WMF offers infrastructure for wikidata queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will need major modifications, or would just die.
So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation.
But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could be exposed to Labs, just like we provide access to replicated SQL databases there: on Labs, you get "raw" access, with added performance and flexibility, but no guarantees about interface stability.
In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows.
Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically tailored to our domain and use case, and there already is an ecosystem of tools that use it. We'd want to refine it a bit I suppose, but by and large, it's pretty much exactly what we need, because it was built around the actual demand for querying wikidata.
So far my current thoughts. Note that this is not a decision or recommendation by the Wikidata team, just my personal take.
-- daniel
Am 10.03.2015 um 16:47 schrieb Markus Krötzsch:
Hi Daniel,
I can understand your thoughts to some extent, but they seem to apply to any potential solution. Committing to a primary query interface will always be, well, a committment. Because of this, I think the big advantage of SPARQL is exactly that it is a technology standard that is not depending on a specific tool. If you want to minimize lock-in and be maximally future-safe, this seems to be a good thing.
Committing the the broadest possible interface, even if it's a standard, is the problem I see, because it makes swapping out the backend close to impossible. I propose committing to an interface that is as narrow as it can be for our use case. That's general best practice in system design, I believe.
Note that we are not only committing to a (standardized, but very complex) query language, but also to our data mapping. WDQ would abstract from that, and give us wiggle room to adjust the mapping later.
I would certainly not support the use of a tool-specific query language that is not specified anywhere but in running code.
Of course the language would need to be well specified, and modified in places. We'd want a production grammar, and a decent parser (recursive descend, probably).
WDQ is great but it is a custom API of a single tool rather than a query language.
It would be our Domain Specific Language. There's a lot to be said for DSLs, if they are well documented.
- "WDQ would go away": That's not a worry I have at all. It will be easy to
write an adaptor for WDQ to SPARQL and to keep up the service as it is for a long time.
That is exactly what I'm proposing. I'd just say that the WDQ version would the canonical one, while the SPARQL one would be considered raw/unstable, like the SQL databases on labs.
- "SPARQL would be too expressive, or could have non-standard extensions that
are hard to support in the future": This can be addressed in two ways. The soft way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these.
This documentation is unlikely to be complete, and people will use what ever "works now", and complain when it breaks. They *will* use vendor specific features and optimizations, even if you tell them they shouldn't. And there will be trouble when they break.
The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of "safe" features. This is easy to do, since SPARQL query parsing and reformulation is already part of any DBMS that supports such queries, and it would be easy to hook into this process to restrict queries without any notable performance overhead. This would minimize vendor lock-in, since one would only commit to (a subset of) the fully standardized features.
That is the plan for sandboxing SPARQL. It's doable, but not easy. Implementing "safe" WDQ on top of SPARQL is going to be simpler and quicker, I think. It will give us a public query interface *faster*.
With both of these in place, your concerns should be addressed without us having to build our own query language from scratch (including parsers, preprocessors, optimizers, user documentation, ...).
With WDQ on to of SPARQL, we need a parser and a SPARQL emitter, that's it. Documentation is already there (well, to a degree), and optimization is provided by the SPARQL endpoint.
Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide all of these details. Right now the main priority should be to get something running rather than to go back to the drawing board.
Yes, absolutely, but what we make available publicly 1) has to be safe - I believe this is easier and faster to do with WDQ. 2) should be future proof - again, easier with WDQ, because it's more restrictive and domain specific. It allows us to change the underlying mapping or technology. SPARQL doesn't easily.
In any case, I'm not saying we shouldn't make a SPARQL endpoint available at all. I'm saying it should not be the canonical query interface, but rather a "raw" query interface. That would give us a lot more headroom to change things later, without breaking a lot of 3rd party code.
-- daniel
Some thoughts: * Either way, there will be a WDQ-like wrapper around SPARQL, maybe as the official interface, maybe only at the current WDQ URL (and I'll have to read up on SPARQL to write that, so if someone else writes it for me, all the better!) * WDQ syntax is very limited (no references, no variables, etc), but it covers a large amount of use cases at this point in time * A WDQ wrapper could add some sought-after functionality quite easily (regular expression label matching comes to mind), but it is probably not a long-term solution, given its limitations * A WDQ-syntax interface would be a great proof-of-concept that the new solution can, at the very least, do what the current one does
On Tue, Mar 10, 2015 at 4:06 PM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 10.03.2015 um 16:47 schrieb Markus Krötzsch:
Hi Daniel,
I can understand your thoughts to some extent, but they seem to apply to
any
potential solution. Committing to a primary query interface will always
be,
well, a committment. Because of this, I think the big advantage of
SPARQL is
exactly that it is a technology standard that is not depending on a
specific
tool. If you want to minimize lock-in and be maximally future-safe, this
seems
to be a good thing.
Committing the the broadest possible interface, even if it's a standard, is the problem I see, because it makes swapping out the backend close to impossible. I propose committing to an interface that is as narrow as it can be for our use case. That's general best practice in system design, I believe.
Note that we are not only committing to a (standardized, but very complex) query language, but also to our data mapping. WDQ would abstract from that, and give us wiggle room to adjust the mapping later.
I would certainly not support the use of a tool-specific query language that is not specified anywhere but in
running code.
Of course the language would need to be well specified, and modified in places. We'd want a production grammar, and a decent parser (recursive descend, probably).
WDQ is great but it is a custom API of a single tool rather than a query language.
It would be our Domain Specific Language. There's a lot to be said for DSLs, if they are well documented.
- "WDQ would go away": That's not a worry I have at all. It will be easy
to
write an adaptor for WDQ to SPARQL and to keep up the service as it is
for a
long time.
That is exactly what I'm proposing. I'd just say that the WDQ version would the canonical one, while the SPARQL one would be considered raw/unstable, like the SQL databases on labs.
- "SPARQL would be too expressive, or could have non-standard extensions
that
are hard to support in the future": This can be addressed in two ways.
The soft
way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these.
This documentation is unlikely to be complete, and people will use what ever "works now", and complain when it breaks. They *will* use vendor specific features and optimizations, even if you tell them they shouldn't. And there will be trouble when they break.
The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of
"safe"
features. This is easy to do, since SPARQL query parsing and
reformulation is
already part of any DBMS that supports such queries, and it would be
easy to
hook into this process to restrict queries without any notable
performance
overhead. This would minimize vendor lock-in, since one would only
commit to (a
subset of) the fully standardized features.
That is the plan for sandboxing SPARQL. It's doable, but not easy. Implementing "safe" WDQ on top of SPARQL is going to be simpler and quicker, I think. It will give us a public query interface *faster*.
With both of these in place, your concerns should be addressed without
us having
to build our own query language from scratch (including parsers,
preprocessors,
optimizers, user documentation, ...).
With WDQ on to of SPARQL, we need a parser and a SPARQL emitter, that's it. Documentation is already there (well, to a degree), and optimization is provided by the SPARQL endpoint.
Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide
all of
these details. Right now the main priority should be to get something
running
rather than to go back to the drawing board.
Yes, absolutely, but what we make available publicly
- has to be safe - I believe this is easier and faster to do with WDQ.
- should be future proof - again, easier with WDQ, because it's more
restrictive and domain specific. It allows us to change the underlying mapping or technology. SPARQL doesn't easily.
In any case, I'm not saying we shouldn't make a SPARQL endpoint available at all. I'm saying it should not be the canonical query interface, but rather a "raw" query interface. That would give us a lot more headroom to change things later, without breaking a lot of 3rd party code.
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
I support Magnus point of view. WDQ is a very good proof of concept but is, I think, to limited to be the primary language of the Wikidata query system.
A possible solution is maybe to support two query languages "as primary": 1 WDQ, at first, in order to have something working quickly 2 A safe subset of SPARQL (if it is possible) that would be implemented later using the experience got form the deployment of the first version of the query system. Or, if it is not possible, an improved version of WDQ that would break its current limitations.
With that I think we have the best of the two worlds: 1. A simple language (WDQ) that allows a short road to production and keeps compatibility with previous systems. 2. A powerful language for advanced uses. 3. Having from scratch the assumption that more than one query language may be used, assumption that may be very useful in the future if we want to change again.
Thomas
Le 10 mars 2015 à 18:01, Magnus Manske magnusmanske@googlemail.com a écrit :
Some thoughts:
- Either way, there will be a WDQ-like wrapper around SPARQL, maybe as the official interface, maybe only at the current WDQ URL (and I'll have to read up on SPARQL to write that, so if someone else writes it for me, all the better!)
- WDQ syntax is very limited (no references, no variables, etc), but it covers a large amount of use cases at this point in time
- A WDQ wrapper could add some sought-after functionality quite easily (regular expression label matching comes to mind), but it is probably not a long-term solution, given its limitations
- A WDQ-syntax interface would be a great proof-of-concept that the new solution can, at the very least, do what the current one does
On Tue, Mar 10, 2015 at 4:06 PM Daniel Kinzler daniel.kinzler@wikimedia.de wrote: Am 10.03.2015 um 16:47 schrieb Markus Krötzsch:
Hi Daniel,
I can understand your thoughts to some extent, but they seem to apply to any potential solution. Committing to a primary query interface will always be, well, a committment. Because of this, I think the big advantage of SPARQL is exactly that it is a technology standard that is not depending on a specific tool. If you want to minimize lock-in and be maximally future-safe, this seems to be a good thing.
Committing the the broadest possible interface, even if it's a standard, is the problem I see, because it makes swapping out the backend close to impossible. I propose committing to an interface that is as narrow as it can be for our use case. That's general best practice in system design, I believe.
Note that we are not only committing to a (standardized, but very complex) query language, but also to our data mapping. WDQ would abstract from that, and give us wiggle room to adjust the mapping later.
I would certainly not support the use of a tool-specific query language that is not specified anywhere but in running code.
Of course the language would need to be well specified, and modified in places. We'd want a production grammar, and a decent parser (recursive descend, probably).
WDQ is great but it is a custom API of a single tool rather than a query language.
It would be our Domain Specific Language. There's a lot to be said for DSLs, if they are well documented.
- "WDQ would go away": That's not a worry I have at all. It will be easy to
write an adaptor for WDQ to SPARQL and to keep up the service as it is for a long time.
That is exactly what I'm proposing. I'd just say that the WDQ version would the canonical one, while the SPARQL one would be considered raw/unstable, like the SQL databases on labs.
- "SPARQL would be too expressive, or could have non-standard extensions that
are hard to support in the future": This can be addressed in two ways. The soft way is to document clearly which features are supported, and to maintain backwards compatibility only wrt these.
This documentation is unlikely to be complete, and people will use what ever "works now", and complain when it breaks. They *will* use vendor specific features and optimizations, even if you tell them they shouldn't. And there will be trouble when they break.
The hard (as in firm, not as in difficult) way is to restrict queries to use only such a limited set of "safe" features. This is easy to do, since SPARQL query parsing and reformulation is already part of any DBMS that supports such queries, and it would be easy to hook into this process to restrict queries without any notable performance overhead. This would minimize vendor lock-in, since one would only commit to (a subset of) the fully standardized features.
That is the plan for sandboxing SPARQL. It's doable, but not easy. Implementing "safe" WDQ on top of SPARQL is going to be simpler and quicker, I think. It will give us a public query interface *faster*.
With both of these in place, your concerns should be addressed without us having to build our own query language from scratch (including parsers, preprocessors, optimizers, user documentation, ...).
With WDQ on to of SPARQL, we need a parser and a SPARQL emitter, that's it. Documentation is already there (well, to a degree), and optimization is provided by the SPARQL endpoint.
Moreover, both of these can be added at any stage of the project, so we are not blocked now by having to decide all of these details. Right now the main priority should be to get something running rather than to go back to the drawing board.
Yes, absolutely, but what we make available publicly
- has to be safe - I believe this is easier and faster to do with WDQ.
- should be future proof - again, easier with WDQ, because it's more
restrictive and domain specific. It allows us to change the underlying mapping or technology. SPARQL doesn't easily.
In any case, I'm not saying we shouldn't make a SPARQL endpoint available at all. I'm saying it should not be the canonical query interface, but rather a "raw" query interface. That would give us a lot more headroom to change things later, without breaking a lot of 3rd party code.
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech _______________________________________________ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Am 10.03.2015 um 18:22 schrieb Thomas Tanon:
I support Magnus point of view. WDQ is a very good proof of concept but is, I think, to limited to be the primary language of the Wikidata query system.
It can be extended. What I want is a limited domain specific language tailored to our primary use cases. Having it largely compatible with WDQ would be great.
I did not mean to imply that we have to accept the current limitations of WDQ. I'm arguing that we should impose sensible limitations on queries, instead of committing to support everything that is possible with SPARQL.
A possible solution is maybe to support two query languages "as primary": 1 WDQ, at first, in order to have something working quickly 2 A safe subset of SPARQL (if it is possible) that would be implemented later using the experience got form the deployment of the first version of the query system. Or, if it is not possible, an improved version of WDQ that would break its current limitations.
Absolutely. I'd like to avoid any commitment to keeping the SPARQL interface stable, though. That's why I'd limit it to labs-based usage.
-- daniel
TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many *simple* SPARQL queries are not possible in WDQ; there is still time to restrict ourselves -- let's give SPARQL a chance before going back.
Hi Daniel,
This discussion is way too abstract. I am missing hard facts about the claimed problems with SPARQL. Nik and Stas have made a careful analysis of the options, while your concerns are mostly high-level worries.
You say that SPARQL would lock us into one tool. As I recall, however, there were at least four different open source SPARQL processors in the list considered by Nik and Stas (BlazeGraph, Virtuoso, 4Store, Jena). Moreover, these are partially based on free libraries that do part of the task (such as query parsing). Of all options considered, this is clearly the most widely supported one with most tools available. It's far from perfect, but this is the same with any option we considered.
You also say that SPARQL is too complex. It is true that SPARQL is a full-featured query language, and that such languages tend to be complex. Nevertheless, this is hardly an effective argument, given that we are all using many extremely complex technologies in our day-to-day work. We need to bring this to a technical level here. If you think a particular feature or group of features is too complex to support, please let us know and we can discuss this.
In order to support WDQ, we would already need many features of SPARQL. If we start with a simplified one-triple-per-statement graph, then I don't actually see how SPARQL is more complex than WDQ. It is more verbose (you would have to write something like "P31" or even "wd:P31" instead of just "31") but this is actually quite useful if you ever want to be able to query over multiple Wikidata instances (such as future Commons and Wikidata in one store).
WDQ also has complex features, for good reasons. For example, to find all things that are instances of subclasses of bridge, you could use the following query patterns:
WDQ: claim[31:(tree[12280][][279])] SPARQL: ?X (P31/P279*) Q12280
WDQ can be more concise than SPARQL where it uses pre-defined query patterns, such as "between" to specify an interval with a single construct, but this seems to be rather syntactic. The real big difference between SPARQL and WDQ is that the former has variables in queries while the latter has not. Both has its merits, but as far as query languages go, the version with variables is by far the most common.
This difference leads to real restrictions. Here are some examples of things that you cannot find in WDQ but that are easy to find in SPARQL:
* People who died in the same city that they were born in. * "Legitimate children" (children of parents who are married to each other) * People who are their own father (likely an error). * Cycles in subclass hierarchies.
Already those queries I find worth going to SPARQL for. Most of these examples do not need any feature other than (AND) pattern matching (the cycle query needs property path expressions: "?X (P279*) ?X"). Besides this, I think we need UNION (or), some of the common FILTERs (range comparisons for dates and numbers), and a geo extension (maybe the most critical part). Is that really so complex?
Now you can reply: "People can always go to labs to run these queries." I am not convinced. If we see good use of a technology and have the means to support it, then we don't serve our users well by restricting it to an experimental service on labs.
Best wishes,
Markus
On 10.03.2015 18:32, Daniel Kinzler wrote:
Am 10.03.2015 um 18:22 schrieb Thomas Tanon:
I support Magnus point of view. WDQ is a very good proof of concept but is, I think, to limited to be the primary language of the Wikidata query system.
It can be extended. What I want is a limited domain specific language tailored to our primary use cases. Having it largely compatible with WDQ would be great.
I did not mean to imply that we have to accept the current limitations of WDQ. I'm arguing that we should impose sensible limitations on queries, instead of committing to support everything that is possible with SPARQL.
A possible solution is maybe to support two query languages "as primary": 1 WDQ, at first, in order to have something working quickly 2 A safe subset of SPARQL (if it is possible) that would be implemented later using the experience got form the deployment of the first version of the query system. Or, if it is not possible, an improved version of WDQ that would break its current limitations.
Absolutely. I'd like to avoid any commitment to keeping the SPARQL interface stable, though. That's why I'd limit it to labs-based usage.
-- daniel
To be fair, the discussion is not "what will we do till the end of time", rather "what do we start with".
Knowing neither SPARQL nor the data storage engine terribly well, it would not be helpful if the service can be DOSed by innocent-looking queries, intentional or not. Exposing only a subset of SPARQL (in this case, via WDQ wrapper) initially would be a way to test the waters. A proper SPARQL API can be exposed at any time later, once we're confident it will hold up.
This seems more like a technical decision in terms of "operational security", rather than a philosophical one about the merits of query languages (where SPARQL is undoubtedly more powerful than WDQ).
On Tue, Mar 10, 2015 at 10:17 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many *simple* SPARQL queries are not possible in WDQ; there is still time to restrict ourselves -- let's give SPARQL a chance before going back.
Hi Daniel,
This discussion is way too abstract. I am missing hard facts about the claimed problems with SPARQL. Nik and Stas have made a careful analysis of the options, while your concerns are mostly high-level worries.
You say that SPARQL would lock us into one tool. As I recall, however, there were at least four different open source SPARQL processors in the list considered by Nik and Stas (BlazeGraph, Virtuoso, 4Store, Jena). Moreover, these are partially based on free libraries that do part of the task (such as query parsing). Of all options considered, this is clearly the most widely supported one with most tools available. It's far from perfect, but this is the same with any option we considered.
You also say that SPARQL is too complex. It is true that SPARQL is a full-featured query language, and that such languages tend to be complex. Nevertheless, this is hardly an effective argument, given that we are all using many extremely complex technologies in our day-to-day work. We need to bring this to a technical level here. If you think a particular feature or group of features is too complex to support, please let us know and we can discuss this.
In order to support WDQ, we would already need many features of SPARQL. If we start with a simplified one-triple-per-statement graph, then I don't actually see how SPARQL is more complex than WDQ. It is more verbose (you would have to write something like "P31" or even "wd:P31" instead of just "31") but this is actually quite useful if you ever want to be able to query over multiple Wikidata instances (such as future Commons and Wikidata in one store).
WDQ also has complex features, for good reasons. For example, to find all things that are instances of subclasses of bridge, you could use the following query patterns:
WDQ: claim[31:(tree[12280][][279])] SPARQL: ?X (P31/P279*) Q12280
WDQ can be more concise than SPARQL where it uses pre-defined query patterns, such as "between" to specify an interval with a single construct, but this seems to be rather syntactic. The real big difference between SPARQL and WDQ is that the former has variables in queries while the latter has not. Both has its merits, but as far as query languages go, the version with variables is by far the most common.
This difference leads to real restrictions. Here are some examples of things that you cannot find in WDQ but that are easy to find in SPARQL:
- People who died in the same city that they were born in.
- "Legitimate children" (children of parents who are married to each other)
- People who are their own father (likely an error).
- Cycles in subclass hierarchies.
Already those queries I find worth going to SPARQL for. Most of these examples do not need any feature other than (AND) pattern matching (the cycle query needs property path expressions: "?X (P279*) ?X"). Besides this, I think we need UNION (or), some of the common FILTERs (range comparisons for dates and numbers), and a geo extension (maybe the most critical part). Is that really so complex?
Now you can reply: "People can always go to labs to run these queries." I am not convinced. If we see good use of a technology and have the means to support it, then we don't serve our users well by restricting it to an experimental service on labs.
Best wishes,
Markus
On 10.03.2015 18:32, Daniel Kinzler wrote:
Am 10.03.2015 um 18:22 schrieb Thomas Tanon:
I support Magnus point of view. WDQ is a very good proof of concept but
is,
I think, to limited to be the primary language of the Wikidata query system.
It can be extended. What I want is a limited domain specific language
tailored
to our primary use cases. Having it largely compatible with WDQ would be
great.
I did not mean to imply that we have to accept the current limitations
of WDQ.
I'm arguing that we should impose sensible limitations on queries,
instead of
committing to support everything that is possible with SPARQL.
A possible solution is maybe to support two query languages "as
primary": 1
WDQ, at first, in order to have something working quickly 2 A safe
subset
of SPARQL (if it is possible) that would be implemented later using the experience got form the deployment of the first version of the query system. Or, if it is not possible, an improved version of WDQ that would break its current limitations.
Absolutely. I'd like to avoid any commitment to keeping the SPARQL
interface
stable, though. That's why I'd limit it to labs-based usage.
-- daniel
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On 11.03.2015 00:44, Magnus Manske wrote:
To be fair, the discussion is not "what will we do till the end of time", rather "what do we start with".
Knowing neither SPARQL nor the data storage engine terribly well, it would not be helpful if the service can be DOSed by innocent-looking queries, intentional or not. Exposing only a subset of SPARQL (in this case, via WDQ wrapper) initially would be a way to test the waters. A proper SPARQL API can be exposed at any time later, once we're confident it will hold up.
This seems more like a technical decision in terms of "operational security", rather than a philosophical one about the merits of query languages (where SPARQL is undoubtedly more powerful than WDQ).
Sure, but my point is that there is zero evidence right now that such a WDQ wrapper would be more robust against intentional DOS. As I explained in my email, such a wrapper would still use a significant amount of SPARQL features in the back. I am sure there will be cases when the new service will go down (we have seen it happening to WDQ and, more generally, to Wikipedia, in the past). What I don't see is how the use of a WDQ API on top of SPARQL would make the overall setup any less vulnerable; it mainly introduces an additional component on top of SPARQL, and we can have a simpler SPARQL-based filter component there if we want, which is likely to be more effective in controlling usage. The only thing that could really lead to a more robust setup would be the use of a more robust backend engine, and I don't see what this should be.
The discussion here is not about which query language we should use. What Daniel proposes is to give up on supporting a standard query language and restricting to a special-purpose API. This is a big deal. If we really want a special-purpose query language for ourselves, we would need to have a discussion about it. WDQ is a useful baseline, but it is is the result of an evolution of ideas and features over time. One would probably come up with a few different decisions when seeing the whole picture from the start. There is a huge cost to designing a query API from scratch, and I would really like to avoid this. Supporting WDQ on top of SPARQL would retain WDQ in its current form and still support standards -- if we want to develop an official custom API, we will give up on both of these benefits, and at the same time push the ETA for Wikidata queries far into the future.
All of this has been discussed and considered in the past. I don't see why one would be kicking off discussions now that question everything decided in meetings and telcos over the past weeks. There is absolutely no new information compared to what has led to the consensus that we all (including Daniel) had reached.
Regards,
Markus
Am 11.03.2015 um 10:08 schrieb Markus Krötzsch:
What I don't see is how the use of a WDQ API on top of SPARQL would make the overall setup any less vulnerable; it mainly introduces an additional component on top of SPARQL, and we can have a simpler SPARQL-based filter component there if we want, which is likely to be more effective in controlling usage.
I disagree on both points: I believe it would be neither simpler, nor more effective. That's pretty much the core of it.
However, I admit that this is currently a gut feeling, a concern I want to share and discuss. It should be investigated before making a decision.
There is a huge cost to designing a query API from scratch, and I would really like to avoid this.
Which is why I want to use one that already exists (WDQ), and back it by something that already exists (SPARQL).
Supporting WDQ on top of SPARQL would retain WDQ in its current form and still support standards --
That's exactly what I propose.
if we want to develop an official custom API, we will give up on both of these benefits, and at the same time push the ETA for Wikidata queries far into the future.
I disagree. If, as I believe, sandboxing WDQ is simpler than sandboxing SPARQL, using WDQ would allow us to have a public query API sooner. But whether my believe is correct needs to be investigated, of course.
All of this has been discussed and considered in the past. I don't see why one would be kicking off discussions now that question everything decided in meetings and telcos over the past weeks. There is absolutely no new information compared to what has led to the consensus that we all (including Daniel) had reached.
The consensus as I remember it was "we should be able to expose SPARQL safely, if we invest enough time to sandbox it". The issue of lock-in was mentioned but not really assessed. The relative cost for sandboxing WDQ vs SPARQL, and the impact on the ETA, was not discussed much. The ad-hoc evaluation spreadsheet shows WDQ as a second to SPARQL (before MQL and ASK), mainly because SPARQL is more powerful.
The downside of that power doesn't factor into the evaluation, nor does the factor of lock-in. Shifting the relative weight in the spreadsheet from power to sustainability makes WDQ come out at the top.
After the initial enthusiasm, this has made me increasingly uneasy over the last weeks. Hence my mail to this list.
On Tue, Mar 10, 2015 at 6:17 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many *simple* SPARQL queries are not possible in WDQ; there is still time to restrict ourselves -- let's give SPARQL a chance before going back.
TLDR, so SPARQL is the one true way.
Nik and Stas have made a careful analysis of the options, ...
citation please
Tom
On 11.03.2015 05:59, Tom Morris wrote:
On Tue, Mar 10, 2015 at 6:17 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
TL;DR: No concrete issues with SPARQL were mentioned so far; OTOH many *simple* SPARQL queries are not possible in WDQ; there is still time to restrict ourselves -- let's give SPARQL a chance before going back.
TLDR, so SPARQL is the one true way.
That's the danger of giving a TL;DR: people can misunderstand them and then use them as strawmen in arguments. My bad. I suggest you read the rest of the email and comment on this. The discussion is too complex and too important to be reduced to three lines.
Nik and Stas have made a careful analysis of the options, ...
citation please
I was referring to the investigations that have led to this spreadsheet:
https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9E...
The choice for SPARQL was not made by me or by anyone who has a special interest in pushing this particular formalism (in fact Nik and Stas can confirm that I have been quite sceptical about the feasibility of using BlazeGraph at first). It was the result of an open-minded discussion among people with very different backgrounds, in search for the most promising technology for our problem. I agree that one could continue this discussion and analysis, but we need to have a balance between theoretical discussions and practical work. It might well happen that we will give up on BlazeGraph and/or SPARQL as the result of practical experiences, but it would be foolish to give up now without even trying.
Markus
Am 11.03.2015 um 10:43 schrieb Markus Krötzsch:
I was referring to the investigations that have led to this spreadsheet:
https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9E...
That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as a backend at all.
I'm questioning the outcome of the public query language evaluation as shown in this sheet:
https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU...
Have a look at the weights, and st the comments, especially Gabriel's.
-- daniel
On 11.03.2015 11:26, Daniel Kinzler wrote:
Am 11.03.2015 um 10:43 schrieb Markus Krötzsch:
I was referring to the investigations that have led to this spreadsheet:
https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9E...
That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as a backend at all.
I'm questioning the outcome of the public query language evaluation as shown in this sheet:
https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU...
Have a look at the weights, and st the comments, especially Gabriel's.
Right, but the overall conclusion still was to use SPARQL there, and this made further discussion of particular scores irrelevant. As it is, the sheet wildly mis-estimates the relative prominence of SPARQL and WDQ (e.g., "documentation" and "support from people"). Search for "SPARQL" in Amazon to get a rough idea. There are a number of free and commercial products implementing it. I am teaching SPARQL to computer science students since at least 5 years, and I know many other people who do. The DBpedia community is using it on Wikipedia-based data. If you have a SPARQL-related question, ask at public-sparql-dev@w3.org; there is usually good support there.
This is really comparing apples and oranges, and it would not do justice to Magnus's work to put him up against an established technology standard. WDQ is great for what it does, but if we go "official" we should move towards what people outside of the Wikidata cosmos are using. After all, this is the main target group for a public query endpoint.
Markus
Hi!
The choice for SPARQL was not made by me or by anyone who has a special interest in pushing this particular formalism (in fact Nik and Stas can confirm that I have been quite sceptical about the feasibility of using BlazeGraph at first). It was the result of an open-minded discussion
We all has some skepticism, not specifically BlazeGraph but in general RDF as an underlying data model due to significant complexity in Wikidata's own data which requires some work to fit into the triples model. After constructing the big spreadheet and analyzing all the options and thinking a bit more on the data model and its usage, we however changed our opinion and decided that the problems which we face are solvable and that solving them would be the way to go.
BlezaGraph specifically emerged as the best available solution due to combination of features, extensibility, openness and support provided by their team. The fact that we are basing on existing technology (RDF/SPARQL) with developed practices was a factor, but not the only overriding one.
We can not claim we know with absolute certainty the only best way to proceed. We can, however, make a honest effort to evaluate all available options and choose the one that we perceive to be the best at the moment. That's what we did. Of course, as we gain more experience and as environments change, we may add another option or even arrive to the conclusion we were mistaken. There's no guarantee against that. But for now we're proceeding with what we have as the best.
As for WDQ, it being a simple language it probably not hard to translate to SPARQL. I'm not sure if that would be good SPARQL but I hope query optimizer would take care of that (yeah, I know it's not magic but we'll see). I'll try to put something together pretty soon and see how it behaves. Some of the WDQ features - such is wide branching with "OR" options - may be quite inefficient in SPARQL, but we can generate it anyway. I'll update when I have something interesting (probably next week).
My basic worries with exposing powerful query languages like SPARQL publicly is that a) there is a large attack surface in the query processing backend, and b) a client can request very expensive operations on the server without performing much work itself. Timeouts can limit the damage, but if they are set reasonably low (<1 min) they will also eliminate some of the supposed power of SPARQL, especially if the data set grows at the rate we all hope for. When reaching the timeout, the client needs to switch to iterative processing and paging. How well does blazegraph support paging of complex SPARQL queries without re-calculating the entire result set?
One of the things I like about the MQL design is that they are careful about identifying a couple of main hierachies (typeOf, geographical containment, taxonomies, ?) that they can efficiently flatten into denormalized plain index lookups. These are very fast and easy to page.
From what I have seen so far, they also seem to directly cover most use
cases that people have come up with so far. While perhaps too limiting in the longer term, I think such a limited 80/20 design would be a better starting point for a high-volume public API with strong availability and response time guarantees. The efficient subset of the API could then be enriched with more expensive end points over time, but those would explicitly not have the same performance guarantees as the core API. Those expensive queries could be executed on a separate cluster / set of machines to avoid interference with the core API.
Another aspect that I think warrants serious attention for an API is the complexity and reliability of constructing queries programmatically. As witnessed by the many issues around seemingly simple languages like SQL, building up query strings from user-supplied values is easy to get wrong. It is always possible to build friendly query languages on top of a JSON API, but it would IMHO be a waste of developer time to repeatedly have to deal with encoding issues and bugs in each client. This doesn't rule out SPARQL (it has a JSON encoding), but I think it's a significant disadvantage of using a custom string syntax like WDQ in the API.
Gabriel
On Wed, Mar 11, 2015 at 4:50 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
My basic worries with exposing powerful query languages like SPARQL publicly is that a) there is a large attack surface in the query processing backend, and b) a client can request very expensive operations on the server without performing much work itself. Timeouts can limit the damage, but if they are set reasonably low (<1 min) they will also eliminate some of the supposed power of SPARQL, especially if the data set grows at the rate we all hope for. When reaching the timeout, the client needs to switch to iterative processing and paging. How well does blazegraph support paging of complex SPARQL queries without re-calculating the entire result set?
One of the things I like about the MQL design is that they are careful about identifying a couple of main hierachies (typeOf, geographical containment, taxonomies, ?) that they can efficiently flatten into denormalized plain index lookups. These are very fast and easy to page. From what I have seen so far, they also seem to directly cover most use cases that people have come up with so far. While perhaps too limiting in the longer term, I think such a limited 80/20 design would be a better starting point for a high-volume public API with strong availability and response time guarantees. The efficient subset of the API could then be enriched with more expensive end points over time, but those would explicitly not have the same performance guarantees as the core API. Those expensive queries could be executed on a separate cluster / set of machines to avoid interference with the core API.
Another aspect that I think warrants serious attention for an API is the complexity and reliability of constructing queries programmatically. As witnessed by the many issues around seemingly simple languages like SQL, building up query strings from user-supplied values is easy to get wrong. It is always possible to build friendly query languages on top of a JSON API, but it would IMHO be a waste of developer time to repeatedly have to deal with encoding issues and bugs in each client. This doesn't rule out SPARQL (it has a JSON encoding), but I think it's a significant disadvantage of using a custom string syntax like WDQ in the API.
My takes is that we should shoot for SPARQL. In fact that is the plan and I'm not convinced changing it now is a good idea. That isn't to say that changing it in the future isn't possible.
Regarding vendor lock in: There are three known workable implementations of SPARQL (Virtuoso, BlazeGraph, Jena). There is a SPARQL implementation on top of Gremlin compatible databases as well but i doubt its efficient. In other words, if BlazeGraph goes the way of Titan we have other options. Some would be way more work then others but we have options. If we can't use Virtuoso or BlazeGraph for some reason we'll be upset and have to do lots of work but I can live with that risk.
Regarding attack surface: yeah. That is a risk. Its one of the first risk areas we're going to have to address. I doubt we'll address it in the next month but we'll get there. And if its unsecurable then we'll have to change course. We'll come back to the mailing list and do this all again. But Viruoso already does this so we know its possible. The fact that we'll have to contribute this to BlazeGraph is the most compelling argument for using Virtuoso but I won't get into that here. Its an important point but a distraction from this topic I think.
Regarding timouts: I don't think they limit the power. Especially when we can timeout the user and then give them instructions in how to run their query in a way without timeouts. Like by building a SPARQL endpoint in labs or locally. This needs to be something that is easy for users to set up anywhere. We had a proof of concept with just that with Titan. Its very possible. I want users to be able to take the same query that they'd run against a public endpoint and run it against a local endpoint.
Regarding hierarchy flattening: This is right in the wheelhouse of inference rules you can drop into triple stores.
Regarding programs building queries: This objection makes sense to me. Sesame has pretty good support for building SPARQL queries. The extension to build RDR queries is about 80 lines of Java including comments. But that only works if you are in Java. MQL is going to be easier to generate. I think, though, that we (the community of people that want to use this thing) can handle this complexity. At worst we'll encourage people to do horrible string manipulation for queries. At best we'll end up with better RDF libraries in more languages. I imagine that's less work all around then us implementing MQL but I can surely be convinced I'm wrong.
All and all I think we should stick with SPARQL as the goal and only change course if it looks like it won't work. I think that is a healthy thing to do. We (now I mean the folks working on the query service) just have to be vigilant of ways in which SPARQL won't work. Particularly around the attack surface issue.
Nik
Hi!
I wrote a small translation tool from WDQ to SPARQL, which can be seen here: http://tools.wmflabs.org/wdq2sparql/
Currently it supports only one model of data and only subset of WDQ syntax, but this can be extended. I wrote it just as PoC to see how hard it would be (not too hard) and to see which kinds of queries would be produced in SPARQL. Don't put too much trust on the exact names of the properties and entities, they are just used as an example now, but eventually (if the tool proves useful) will be replaced with real ones.
BTW, speaking of the worries about being able to produce "heavy" SPARQL queries, WDQ has tree and web operators which can produce very expensive queries, and so can simple OR clauses, since they require unions which can be very expensive.
If you play with it and notice some syntax that is supposed to work (see the supported list on the page) please tell me. Other comments/thoughts/suggestions also welcome.
Thanks, that looks great!
On Tue, Mar 17, 2015 at 5:59 AM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I wrote a small translation tool from WDQ to SPARQL, which can be seen here: http://tools.wmflabs.org/wdq2sparql/
Currently it supports only one model of data and only subset of WDQ syntax, but this can be extended. I wrote it just as PoC to see how hard it would be (not too hard) and to see which kinds of queries would be produced in SPARQL. Don't put too much trust on the exact names of the properties and entities, they are just used as an example now, but eventually (if the tool proves useful) will be replaced with real ones.
BTW, speaking of the worries about being able to produce "heavy" SPARQL queries, WDQ has tree and web operators which can produce very expensive queries, and so can simple OR clauses, since they require unions which can be very expensive.
If you play with it and notice some syntax that is supposed to work (see the supported list on the page) please tell me. Other comments/thoughts/suggestions also welcome. -- Stas Malyshev smalyshev@wikimedia.org
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi!
So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation.
WDQ query language is somewhat limited as I understand. We can of write WDQ->SPARQL translator, I imagine, I think even it's a good idea to do this for BC/transition reasons. But I'm not sure restricting ourselves to only this syntax, created with a different tool in mind, is the best idea. People would ask us for full SPARQL as soon as they'd know we're running SPARQL db.
In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows.
Well, there's other side of implementing the actual WDQ feature set in SPARQL which may take some time, since these languages are a bit different (esp. in tree traversal aspects). But I think WDQ language support should be on the agenda, I'm just not sure it should be the first item.
Am 10.03.2015 um 21:09 schrieb Stas Malyshev:
People would ask us for full SPARQL as soon as they'd know we're running SPARQL db.
Sure. And I'D tell them "you can use SPARQL on labs, but beware that it may change or go away".
How long has WDQ been in service? What proportion of the total aggregate lifetime Wikidata apps, presuming it survives, do the current, as of Mar 2015, Wikidata apps represent?
Should the question of premature optimization (or optimisation) be considered?
Tom
p.s. Since your opinion doesn't represent the official team position, what, exactly, *IS* the official team position?
pps I don't disagree that there are strong negative aspects to using SPARQL, but you weaken your argument by saying that the status quo is the only way forward
On Tue, Mar 10, 2015 at 10:31 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
Hi all!
After the initial enthusiasm, I have grown increasingly wary of the prospect of exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to share my (personal and unfinished) thoughts about this on this list, as food for thought and a basis for discussion.
Basically, I fear that exposing SPARQL will lock us in with respect to the backend technology we use. Once it's there, people will rely on it, and taking it away would be very harsh. That would make it practically impossible to move to, say, Neo4J in the future. This is even more true if if expose vendor specific extensions like RDR/SPARQL*.
Also, exposing SPARQL as our primary query interface probably means abruptly discontinuing support for WDQ. It's pretty clear that the original WDQ service is not going to be maintained once the WMF offers infrastructure for wikidata queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will need major modifications, or would just die.
So, my proposal is to expose a WDQ-like service as our primary query interface. This follows the general principle having narrow interfaces to make it easy to swap out the implementation.
But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could be exposed to Labs, just like we provide access to replicated SQL databases there: on Labs, you get "raw" access, with added performance and flexibility, but no guarantees about interface stability.
In terms of development resources and timeline, exposing WDQ may actually get us a public query endpoint more quickly: sandboxing full SPARQL may likely turn out to be a lot harder than sandboxing the more limited set of queries WDQ allows.
Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically tailored to our domain and use case, and there already is an ecosystem of tools that use it. We'd want to refine it a bit I suppose, but by and large, it's pretty much exactly what we need, because it was built around the actual demand for querying wikidata.
So far my current thoughts. Note that this is not a decision or recommendation by the Wikidata team, just my personal take.
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On Wed, Mar 11, 2015 at 4:52 AM Tom Morris tfmorris@gmail.com wrote:
How long has WDQ been in service?
Before September 2013. So, 1.5-2 years.
wikidata-tech@lists.wikimedia.org