Positive thoughts on exposing a SPARQL endpoint - Wikidata-tech

12 Mar 2015

Dear WikiData Developers,

I am very pleased to read your intention to deploy a SPARQL based API 
for WikiData. I am one of the developers behind sparql.uniprot.org a 
public free to use SPARQL endpoint (with content under CC-ND, soon to be 
made more liberal). UniProt can be seen as an encyclopaedia of facts 
about proteins, and we have about a 90 million pages of entries e.g. 
http://www.uniprot.org/uniprot/P49407 (corresponds to 
http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in 
UniProtKB (and another 200 million or so in our supporting related 
datasets) lead to a bit more than 19 billion unique triples.

We currently use Virtuoso, but I completely understand why you select 
BlazeGraph, it seems to me a better fit for your mission and deployment 
scenario. I evaluated what was then BigData 3 years ago and would have 
selected it if we did not need to go for vertical scaling due to having 
just 1U in our computer room available.

I also understand the worries that you have about being able to run a 
service that is resilient to DOS (intentional or not). We have not yet 
needed to run any mitigation strategies that are more complex than 
banning a IP+user-agent string. However, we do have a few ideas about 
how to make that automatic. Even if we don’t have the time to work on 
these ideas yet. Our query timeouts are generously above what a sparsely 
used HTTP connection supports in practice. You should also not be 
worried about GRAPH traversals, in our practice simple DISTINCT queries 
are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE 
(?subject ?predicate ?object) that some people like to send daily can be 
a challenge on large databases. We could institute draconian timeouts 
but we don’t because we want to get the difficult queries, the simple 
our users can do on our main site, but the analytical ones require a 
solution such as SPARQL.

Maintaining a SPARQL endpoint for public use requires you to focus on 
client management not query management. You can have one client with 
relatively easy queries, but one that happily sends you 40,000 of these 
per second. Others send you a query that looks complicated and expensive 
but is actually very selective and ends up taking near to no time at 
all.

One of the solutions that we will look into long term is a query manager 
in front of our endpoint. I will try to explain the idea behind that but 
if its fuzzy just ask and I will try to explain better ;) Assume, you 
have a CPU budget available for hosting your SPARQL endpoint. Lets say 
100 ticks per day. If you have one user that user should be able to get 
all 100 ticks. If you have one hundred users each user gets 1 tick. If 
you limit your one user to 1 tick you have wasted 99 ticks. If you let 
the fist user use 100 ticks and the other ninety-nine users their 1, you 
have broken the budget by 99 ticks. One solution is therefore to allow 
clients to optimistically run a query for longer than their ticks, but 
the moment that other clients arrive they get kicked off. e.g. complex 
queries run ok on sunday evening but not on monday 10am.

You can do this by having a service in front of your API that takes 
SPARQL queries, queues them for execution in a priority queue. Sends 
back a 303 response with a retry-after and location header so that the 
HTTP responses don’t die. The priority queue ensures everyone can get 
a turn to run their queries, by reducing the priority of people who are 
already running a query etc… The query management service can then see 
if a query returned in its tick and if anyone else has scheduled a 
query. If someone else has a query outstanding the query manager tells 
the SPARQL engine to stop working on the query and send back any results 
if possible. I don’t know if BlazeGraph allows this but I know how to 
do it for GraphDB and have an idea about how to do it in Virtuoso, I am 
however sure that Systap can add this to BlazeGraph on short notice. 
Such a query manager will deal with all common forms of DOS attacks.

Now to the advantages of SPARQL over the other options Gremlin included 
is that it is a open standard that is deployed in the wild for use of 
HTTP by complete strangers. Not just in academia but also e.g. at 
http://open-data.europa.eu/en/linked-data so its more likely that you 
can combine data in primary resources with data in WikiData. Also you 
won’t be the only ones worrying about attacks on your public endpoint 
and will have a larger community to share knowledge and code with.

On the Reification business, we use it extensively and in our case about 
15% of our triples are in reification quads. But that is the surface 
representation how that is actually stored in disk (or materialised on 
demand) is completely different and not something you should worry about 
at this time. Most of the time you don’t need reification and can 
often be avoided by better modelling. For example in our archive of 
protein sequences we used to use reification to note that a protein 
sequence can no longer be found in a database entry. Now we model a 
database entry just like all the other active entries we just give it a 
new unique URI and a rdf:type to say its obsolete and that it 
corresponds to a version of an active database entry. When teaching 
SPARQL to scientist I call this way of thinking “model the measurement 
result, not the conclusion”. e.g. instead of saying

example:Jerven example:length_in_cm 197 .

model

example:Jerven
example:measured_length
[ example:length_measurement_result_in_cm 197 ;
example:measurement_date “2015-02-27”^^xsd:date . ] ;
[ example:length_measurement_result_in_cm 60 ;
example:measurement_date “1983-06-21”^^xsd:date . ] .

As you can see in the second model you don’t have to worry about 
invalidating data as the measurement stays correct. You would have been 
forced to model like this in practice with Titan, and you should do the 
same in RDF. Just because reification is available does not mean you 
must use it ;)

I hope that this mail was constructive and that some of your worries 
will be less knowing that there are (possible) solutions to maintain a 
reliable service.

Regards,
Jerven

PS. There are more opensource options for SPARQL eg. RDF4J

-- 
Jerven Tjalling Bolleman
SIB | Swiss Institute of Bioinformatics
CMU - 1, rue Michel Servet - 1211 Geneva 4
t: +41 22 379 58 85 - f: +41 22 379 58 58
Jerven.Bolleman(a)isb-sib.ch - http://www.isb-sib.ch