On Wed, Mar 11, 2015 at 4:50 PM, Gabriel Wicke <gwicke@wikimedia.org> wrote:

My basic worries with exposing powerful query languages like SPARQL publicly is that a) there is a large attack surface in the query processing backend, and b) a client can request very expensive operations on the server without performing much work itself. Timeouts can limit the damage, but if they are set reasonably low (<1 min) they will also eliminate some of the supposed power of SPARQL, especially if the data set grows at the rate we all hope for. When reaching the timeout, the client needs to switch to iterative processing and paging. How well does blazegraph support paging of complex SPARQL queries without re-calculating the entire result set?

One of the things I like about the MQL design is that they are careful about identifying a couple of main hierachies (typeOf, geographical containment, taxonomies, ?) that they can efficiently flatten into denormalized plain index lookups. These are very fast and easy to page. From what I have seen so far, they also seem to directly cover most use cases that people have come up with so far. While perhaps too limiting in the longer term, I think such a limited 80/20 design would be a better starting point for a high-volume public API with strong availability and response time guarantees. The efficient subset of the API could then be enriched with more expensive end points over time, but those would explicitly not have the same performance guarantees as the core API. Those expensive queries could be executed on a separate cluster / set of machines to avoid interference with the core API.

Another aspect that I think warrants serious attention for an API is the complexity and reliability of constructing queries programmatically. As witnessed by the many issues around seemingly simple languages like SQL, building up query strings from user-supplied values is easy to get wrong. It is always possible to build friendly query languages on top of a JSON API, but it would IMHO be a waste of developer time to repeatedly have to deal with encoding issues and bugs in each client. This doesn't rule out SPARQL (it has a JSON encoding), but I think it's a significant disadvantage of using a custom string syntax like WDQ in the API.

My takes is that we should shoot for SPARQL. In fact that is the plan and I'm not convinced changing it now is a good idea. That isn't to say that changing it in the future isn't possible.

Regarding vendor lock in: There are three known workable implementations of SPARQL (Virtuoso, BlazeGraph, Jena). There is a SPARQL implementation on top of Gremlin compatible databases as well but i doubt its efficient. In other words, if BlazeGraph goes the way of Titan we have other options. Some would be way more work then others but we have options. If we can't use Virtuoso or BlazeGraph for some reason we'll be upset and have to do lots of work but I can live with that risk.

Regarding attack surface: yeah. That is a risk. Its one of the first risk areas we're going to have to address. I doubt we'll address it in the next month but we'll get there. And if its unsecurable then we'll have to change course. We'll come back to the mailing list and do this all again. But Viruoso already does this so we know its possible. The fact that we'll have to contribute this to BlazeGraph is the most compelling argument for using Virtuoso but I won't get into that here. Its an important point but a distraction from this topic I think.

Regarding timouts: I don't think they limit the power. Especially when we can timeout the user and then give them instructions in how to run their query in a way without timeouts. Like by building a SPARQL endpoint in labs or locally. This needs to be something that is easy for users to set up anywhere. We had a proof of concept with just that with Titan. Its very possible. I want users to be able to take the same query that they'd run against a public endpoint and run it against a local endpoint.

Regarding hierarchy flattening: This is right in the wheelhouse of inference rules you can drop into triple stores.

Regarding programs building queries: This objection makes sense to me. Sesame has pretty good support for building SPARQL queries. The extension to build RDR queries is about 80 lines of Java including comments. But that only works if you are in Java. MQL is going to be easier to generate. I think, though, that we (the community of people that want to use this thing) can handle this complexity. At worst we'll encourage people to do horrible string manipulation for queries. At best we'll end up with better RDF libraries in more languages. I imagine that's less work all around then us implementing MQL but I can surely be convinced I'm wrong.

All and all I think we should stick with SPARQL as the goal and only change course if it looks like it won't work. I think that is a healthy thing to do. We (now I mean the folks working on the query service) just have to be vigilant of ways in which SPARQL won't work. Particularly around the attack surface issue.

Nik