Hey! Thanks for the email! Its reasonably similar to our experience
working on other similar services. I'll respond some inline.
On Thu, Mar 12, 2015 at 1:21 PM, Jerven Tjalling Bolleman <
Jerven.Bolleman(a)isb-sib.ch> wrote:
Dear WikiData Developers,
I am very pleased to read your intention to deploy a SPARQL based API for
WikiData. I am one of the developers behind
sparql.uniprot.org a public
free to use SPARQL endpoint (with content under CC-ND, soon to be made more
liberal). UniProt can be seen as an encyclopaedia of facts about proteins,
and we have about a 90 million pages of entries e.g.
http://www.uniprot.org/uniprot/P49407 (corresponds to
http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in
UniProtKB (and another 200 million or so in our supporting related
datasets) lead to a bit more than 19 billion unique triples.
We currently use Virtuoso, but I completely understand why you select
BlazeGraph, it seems to me a better fit for your mission and deployment
scenario. I evaluated what was then BigData 3 years ago and would have
selected it if we did not need to go for vertical scaling due to having
just 1U in our computer room available.
I also understand the worries that you have about being able to run a
service that is resilient to DOS (intentional or not). We have not yet
needed to run any mitigation strategies that are more complex than banning
a IP+user-agent string. However, we do have a few ideas about how to make
that automatic. Even if we don’t have the time to work on these ideas yet.
Our query timeouts are generously above what a sparsely used HTTP
connection supports in practice. You should also not be worried about GRAPH
traversals, in our practice simple DISTINCT queries are much more painful.
e.g. the common COUNT DISTINCT(?subject) WHERE (?subject ?predicate
?object) that some people like to send daily can be a challenge on large
databases. We could institute draconian timeouts but we don’t because we
want to get the difficult queries, the simple our users can do on our main
site, but the analytical ones require a solution such as SPARQL.
Cool! I'm primarily interested in making sure timeouts work and queries
run in a way that they can't swamp the Java heap. My _hope_ is that that
is good enough. Banning particular constructions isn't something I'd like
to do but its certainly possible - I prototyped it a week or so ago. Its
just a tool I'd rather not reach for.
Maintaining a SPARQL endpoint for public use requires you to focus on
client management not query management. You can have one client with
relatively easy queries, but one that happily sends you 40,000 of these per
second. Others send you a query that looks complicated and expensive but is
actually very selective and ends up taking near to no time at all.
One of the solutions that we will look into long term
is a query manager
in front of our endpoint. I will try to explain the idea behind that but if
its fuzzy just ask and I will try to explain better ;) Assume, you have a
CPU budget available for hosting your SPARQL endpoint. Lets say 100 ticks
per day. If you have one user that user should be able to get all 100
ticks. If you have one hundred users each user gets 1 tick. If you limit
your one user to 1 tick you have wasted 99 ticks. If you let the fist user
use 100 ticks and the other ninety-nine users their 1, you have broken the
budget by 99 ticks. One solution is therefore to allow clients to
optimistically run a query for longer than their ticks, but the moment that
other clients arrive they get kicked off. e.g. complex queries run ok on
sunday evening but not on monday 10am.
Nice! We've talked about some of this with our full text search. There is
different - its harder to construct overwhelming queries but they can't be
killed. BlazeGraph supports killing queries but its pretty trivial to
write nasty queries. We have tools for limiting user concurrent requests
and total requests. This is neat because its strictly better than
timeouts. We'll certainly give it a shot!
You can do this by having a service in front of your API that takes SPARQL
queries, queues them for execution in a priority queue. Sends back a 303
response with a retry-after and location header so that the HTTP responses
don’t die. The priority queue ensures everyone can get a turn to run their
queries, by reducing the priority of people who are already running a query
etc… The query management service can then see if a query returned in its
tick and if anyone else has scheduled a query. If someone else has a query
outstanding the query manager tells the SPARQL engine to stop working on
the query and send back any results if possible. I don’t know if BlazeGraph
allows this but I know how to do it for GraphDB and have an idea about how
to do it in Virtuoso, I am however sure that Systap can add this to
BlazeGraph on short notice. Such a query manager will deal with all common
forms of DOS attacks.
Now to the advantages of SPARQL over the other options Gremlin included is
that it is a open standard that is deployed in the wild for use of HTTP by
complete strangers. Not just in academia but also e.g. at
http://open-data.europa.eu/en/linked-data so its more likely that you can
combine data in primary resources with data in WikiData. Also you won’t be
the only ones worrying about attacks on your public endpoint and will have
a larger community to share knowledge and code with.
On the Reification business, we use it extensively and in our case about
15% of our triples are in reification quads. But that is the surface
representation how that is actually stored in disk (or materialised on
demand) is completely different and not something you should worry about at
this time. Most of the time you don’t need reification and can often be
avoided by better modelling. For example in our archive of protein
sequences we used to use reification to note that a protein sequence can no
longer be found in a database entry. Now we model a database entry just
like all the other active entries we just give it a new unique URI and a
rdf:type to say its obsolete and that it corresponds to a version of an
active database entry. When teaching SPARQL to scientist I call this way of
thinking “model the measurement result, not the conclusion”. e.g. instead
of saying
example:Jerven example:length_in_cm 197 .
model
example:Jerven
example:measured_length
[ example:length_measurement_result_in_cm 197 ;
example:measurement_date “2015-02-27”^^xsd:date . ] ;
[ example:length_measurement_result_in_cm 60 ;
example:measurement_date “1983-06-21”^^xsd:date . ] .
As you can see in the second model you don’t have to worry about
invalidating data as the measurement stays correct. You would have been
forced to model like this in practice with Titan, and you should do the
same in RDF. Just because reification is available does not mean you must
use it ;)
I have opinions around simplicity - mostly that, for the best rank subset
of data, the system should behave as though example:Jerven
example:length_in_cm 197 . was in the db. If that is done using inference,
query rewriting, pregenerated triples, or unicorns, I'm not sure yet. That
is what we're working on now because they seem like big tradeoffs.
Probably incorrectly, when I say reification I think it means "all of the
references, qualifiers, and parts of the value (like precision and error)
are available" not the specific construction involving blank nodes. We,
eventually, need all that stuff in the query service. We're quite unlikely
to actually use the blank nodes way of describing it though.
Nik