Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

12 Mar 2015

On Wed, Mar 11, 2015 at 4:50 PM, Gabriel Wicke &lt;gwicke(a)wikimedia.org&gt; wrote:

...
  My basic worries with exposing powerful query
languages like SPARQL
 publicly is that a) there is a large attack surface in the query processing
 backend, and b) a client can request very expensive operations on the
 server without performing much work itself. Timeouts can limit the damage,
 but if they are set reasonably low (<1 min) they will also eliminate some
 of the supposed power of SPARQL, especially if the data set grows at the
 rate we all hope for. When reaching the timeout, the client needs to switch
 to iterative processing and paging. How well does blazegraph support paging
 of complex SPARQL queries without re-calculating the entire result set?

 One of the things I like about the MQL design is that they are careful
 about identifying a couple of main hierachies (typeOf, geographical
 containment, taxonomies, ?) that they can efficiently flatten into
 denormalized plain index lookups. These are very fast and easy to page.
 From what I have seen so far, they also seem to directly cover most use
 cases that people have come up with so far. While perhaps too limiting in
 the longer term, I think such a limited 80/20 design would be a better
 starting point for a high-volume public API with strong availability and
 response time guarantees. The efficient subset of the API could then be
 enriched with more expensive end points over time, but those would
 explicitly not have the same performance guarantees as the core API. Those
 expensive queries could be executed on a separate cluster / set of machines
 to avoid interference with the core API.

 Another aspect that I think warrants serious attention for an API is the
 complexity and reliability of constructing queries programmatically. As
 witnessed by the many issues around seemingly simple languages like SQL,
 building up query strings from user-supplied values is easy to get wrong.
 It is always possible to build friendly query languages on top of a JSON
 API, but it would IMHO be a waste of developer time to repeatedly have to
 deal with encoding issues and bugs in each client. This doesn't rule out
 SPARQL (it has a JSON encoding), but I think it's a significant
 disadvantage of using a custom string syntax like WDQ in the API.

My takes is that we should shoot for SPARQL.  In fact that is the plan and
I'm not convinced changing it now is a good idea.  That isn't to say that
changing it in the future isn't possible.

Regarding vendor lock in: There are three known workable implementations of
SPARQL (Virtuoso, BlazeGraph, Jena).  There is a SPARQL implementation on
top of Gremlin compatible databases as well but i doubt its efficient.  In
other words, if BlazeGraph goes the way of Titan we have other options.
Some would be way more work then others but we have options.  If we can't
use Virtuoso or BlazeGraph for some reason we'll be upset and have to do
lots of work but I can live with that risk.

Regarding attack surface: yeah.  That is a risk.  Its one of the first risk
areas we're going to have to address.  I doubt we'll address it in the next
month but we'll get there.  And if its unsecurable then we'll have to
change course.  We'll come back to the mailing list and do this all again.
But Viruoso already does this so we know its possible.  The fact that we'll
have to contribute this to BlazeGraph is the most compelling argument for
using Virtuoso but I won't get into that here.  Its an important point but
a distraction from this topic I think.

Regarding timouts:  I don't think they limit the power.  Especially when we
can timeout the user and then give them instructions in how to run their
query in a way without timeouts.  Like by building a SPARQL endpoint in
labs or locally.  This needs to be something that is easy for users to set
up anywhere.  We had a proof of concept with just that with Titan.  Its
very possible.  I want users to be able to take the same query that they'd
run against a public endpoint and run it against a local endpoint.

Regarding hierarchy flattening:  This is right in the wheelhouse of
inference rules you can drop into triple stores.

Regarding programs building queries:  This objection makes sense to me.
Sesame has pretty good support for building SPARQL queries.  The extension
to build RDR queries is about 80 lines of Java including comments.  But
that only works if you are in Java.  MQL is going to be easier to
generate.  I think, though, that we (the community of people that want to
use this thing) can handle this complexity.  At worst we'll encourage
people to do horrible string manipulation for queries.  At best we'll end
up with better RDF libraries in more languages.  I imagine that's less work
all around then us implementing MQL but I can surely be convinced I'm wrong.

All and all I think we should stick with SPARQL as the goal and only change
course if it looks like it won't work.  I think that is a healthy thing to
do.  We (now I mean the folks working on the query service) just have to be
vigilant of ways in which SPARQL won't work.  Particularly around the
attack surface issue.

Nik

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint