On Wed, Mar 11, 2015 at 4:50 PM, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:
My basic worries with exposing powerful query
languages like SPARQL
publicly is that a) there is a large attack surface in the query processing
backend, and b) a client can request very expensive operations on the
server without performing much work itself. Timeouts can limit the damage,
but if they are set reasonably low (<1 min) they will also eliminate some
of the supposed power of SPARQL, especially if the data set grows at the
rate we all hope for. When reaching the timeout, the client needs to switch
to iterative processing and paging. How well does blazegraph support paging
of complex SPARQL queries without re-calculating the entire result set?
One of the things I like about the MQL design is that they are careful
about identifying a couple of main hierachies (typeOf, geographical
containment, taxonomies, ?) that they can efficiently flatten into
denormalized plain index lookups. These are very fast and easy to page.
From what I have seen so far, they also seem to directly cover most use
cases that people have come up with so far. While perhaps too limiting in
the longer term, I think such a limited 80/20 design would be a better
starting point for a high-volume public API with strong availability and
response time guarantees. The efficient subset of the API could then be
enriched with more expensive end points over time, but those would
explicitly not have the same performance guarantees as the core API. Those
expensive queries could be executed on a separate cluster / set of machines
to avoid interference with the core API.
Another aspect that I think warrants serious attention for an API is the
complexity and reliability of constructing queries programmatically. As
witnessed by the many issues around seemingly simple languages like SQL,
building up query strings from user-supplied values is easy to get wrong.
It is always possible to build friendly query languages on top of a JSON
API, but it would IMHO be a waste of developer time to repeatedly have to
deal with encoding issues and bugs in each client. This doesn't rule out
SPARQL (it has a JSON encoding), but I think it's a significant
disadvantage of using a custom string syntax like WDQ in the API.
My takes is that we should shoot for SPARQL. In fact that is the plan and
I'm not convinced changing it now is a good idea. That isn't to say that
changing it in the future isn't possible.
Regarding vendor lock in: There are three known workable implementations of
SPARQL (Virtuoso, BlazeGraph, Jena). There is a SPARQL implementation on
top of Gremlin compatible databases as well but i doubt its efficient. In
other words, if BlazeGraph goes the way of Titan we have other options.
Some would be way more work then others but we have options. If we can't
use Virtuoso or BlazeGraph for some reason we'll be upset and have to do
lots of work but I can live with that risk.
Regarding attack surface: yeah. That is a risk. Its one of the first risk
areas we're going to have to address. I doubt we'll address it in the next
month but we'll get there. And if its unsecurable then we'll have to
change course. We'll come back to the mailing list and do this all again.
But Viruoso already does this so we know its possible. The fact that we'll
have to contribute this to BlazeGraph is the most compelling argument for
using Virtuoso but I won't get into that here. Its an important point but
a distraction from this topic I think.
Regarding timouts: I don't think they limit the power. Especially when we
can timeout the user and then give them instructions in how to run their
query in a way without timeouts. Like by building a SPARQL endpoint in
labs or locally. This needs to be something that is easy for users to set
up anywhere. We had a proof of concept with just that with Titan. Its
very possible. I want users to be able to take the same query that they'd
run against a public endpoint and run it against a local endpoint.
Regarding hierarchy flattening: This is right in the wheelhouse of
inference rules you can drop into triple stores.
Regarding programs building queries: This objection makes sense to me.
Sesame has pretty good support for building SPARQL queries. The extension
to build RDR queries is about 80 lines of Java including comments. But
that only works if you are in Java. MQL is going to be easier to
generate. I think, though, that we (the community of people that want to
use this thing) can handle this complexity. At worst we'll encourage
people to do horrible string manipulation for queries. At best we'll end
up with better RDF libraries in more languages. I imagine that's less work
all around then us implementing MQL but I can surely be convinced I'm wrong.
All and all I think we should stick with SPARQL as the goal and only change
course if it looks like it won't work. I think that is a healthy thing to
do. We (now I mean the folks working on the query service) just have to be
vigilant of ways in which SPARQL won't work. Particularly around the
attack surface issue.