Dear WikiData Developers,
I am very pleased to read your intention to deploy a SPARQL based API
for WikiData. I am one of the developers behind
sparql.uniprot.org a
public free to use SPARQL endpoint (with content under CC-ND, soon to be
made more liberal). UniProt can be seen as an encyclopaedia of facts
about proteins, and we have about a 90 million pages of entries e.g.
http://www.uniprot.org/uniprot/P49407 (corresponds to
http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in
UniProtKB (and another 200 million or so in our supporting related
datasets) lead to a bit more than 19 billion unique triples.
We currently use Virtuoso, but I completely understand why you select
BlazeGraph, it seems to me a better fit for your mission and deployment
scenario. I evaluated what was then BigData 3 years ago and would have
selected it if we did not need to go for vertical scaling due to having
just 1U in our computer room available.
I also understand the worries that you have about being able to run a
service that is resilient to DOS (intentional or not). We have not yet
needed to run any mitigation strategies that are more complex than
banning a IP+user-agent string. However, we do have a few ideas about
how to make that automatic. Even if we don’t have the time to work on
these ideas yet. Our query timeouts are generously above what a sparsely
used HTTP connection supports in practice. You should also not be
worried about GRAPH traversals, in our practice simple DISTINCT queries
are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE
(?subject ?predicate ?object) that some people like to send daily can be
a challenge on large databases. We could institute draconian timeouts
but we don’t because we want to get the difficult queries, the simple
our users can do on our main site, but the analytical ones require a
solution such as SPARQL.
Maintaining a SPARQL endpoint for public use requires you to focus on
client management not query management. You can have one client with
relatively easy queries, but one that happily sends you 40,000 of these
per second. Others send you a query that looks complicated and expensive
but is actually very selective and ends up taking near to no time at
all.
One of the solutions that we will look into long term is a query manager
in front of our endpoint. I will try to explain the idea behind that but
if its fuzzy just ask and I will try to explain better ;) Assume, you
have a CPU budget available for hosting your SPARQL endpoint. Lets say
100 ticks per day. If you have one user that user should be able to get
all 100 ticks. If you have one hundred users each user gets 1 tick. If
you limit your one user to 1 tick you have wasted 99 ticks. If you let
the fist user use 100 ticks and the other ninety-nine users their 1, you
have broken the budget by 99 ticks. One solution is therefore to allow
clients to optimistically run a query for longer than their ticks, but
the moment that other clients arrive they get kicked off. e.g. complex
queries run ok on sunday evening but not on monday 10am.
You can do this by having a service in front of your API that takes
SPARQL queries, queues them for execution in a priority queue. Sends
back a 303 response with a retry-after and location header so that the
HTTP responses don’t die. The priority queue ensures everyone can get
a turn to run their queries, by reducing the priority of people who are
already running a query etc… The query management service can then see
if a query returned in its tick and if anyone else has scheduled a
query. If someone else has a query outstanding the query manager tells
the SPARQL engine to stop working on the query and send back any results
if possible. I don’t know if BlazeGraph allows this but I know how to
do it for GraphDB and have an idea about how to do it in Virtuoso, I am
however sure that Systap can add this to BlazeGraph on short notice.
Such a query manager will deal with all common forms of DOS attacks.
Now to the advantages of SPARQL over the other options Gremlin included
is that it is a open standard that is deployed in the wild for use of
HTTP by complete strangers. Not just in academia but also e.g. at
http://open-data.europa.eu/en/linked-data so its more likely that you
can combine data in primary resources with data in WikiData. Also you
won’t be the only ones worrying about attacks on your public endpoint
and will have a larger community to share knowledge and code with.
On the Reification business, we use it extensively and in our case about
15% of our triples are in reification quads. But that is the surface
representation how that is actually stored in disk (or materialised on
demand) is completely different and not something you should worry about
at this time. Most of the time you don’t need reification and can
often be avoided by better modelling. For example in our archive of
protein sequences we used to use reification to note that a protein
sequence can no longer be found in a database entry. Now we model a
database entry just like all the other active entries we just give it a
new unique URI and a rdf:type to say its obsolete and that it
corresponds to a version of an active database entry. When teaching
SPARQL to scientist I call this way of thinking “model the measurement
result, not the conclusion”. e.g. instead of saying
example:Jerven example:length_in_cm 197 .
model
example:Jerven
example:measured_length
[ example:length_measurement_result_in_cm 197 ;
example:measurement_date “2015-02-27”^^xsd:date . ] ;
[ example:length_measurement_result_in_cm 60 ;
example:measurement_date “1983-06-21”^^xsd:date . ] .
As you can see in the second model you don’t have to worry about
invalidating data as the measurement stays correct. You would have been
forced to model like this in practice with Titan, and you should do the
same in RDF. Just because reification is available does not mean you
must use it ;)
I hope that this mail was constructive and that some of your worries
will be less knowing that there are (possible) solutions to maintain a
reliable service.
Regards,
Jerven
PS. There are more opensource options for SPARQL eg. RDF4J
--
Jerven Tjalling Bolleman
SIB | Swiss Institute of Bioinformatics
CMU - 1, rue Michel Servet - 1211 Geneva 4
t: +41 22 379 58 85 - f: +41 22 379 58 58
Jerven.Bolleman(a)isb-sib.ch -
http://www.isb-sib.ch