Dear WikiData Developers,
I am very pleased to read your intention to deploy a SPARQL based API for WikiData. I am one of the developers behind sparql.uniprot.org a public free to use SPARQL endpoint (with content under CC-ND, soon to be made more liberal). UniProt can be seen as an encyclopaedia of facts about proteins, and we have about a 90 million pages of entries e.g. http://www.uniprot.org/uniprot/P49407 (corresponds to http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in UniProtKB (and another 200 million or so in our supporting related datasets) lead to a bit more than 19 billion unique triples.
We currently use Virtuoso, but I completely understand why you select BlazeGraph, it seems to me a better fit for your mission and deployment scenario. I evaluated what was then BigData 3 years ago and would have selected it if we did not need to go for vertical scaling due to having just 1U in our computer room available.
I also understand the worries that you have about being able to run a service that is resilient to DOS (intentional or not). We have not yet needed to run any mitigation strategies that are more complex than banning a IP+user-agent string. However, we do have a few ideas about how to make that automatic. Even if we don’t have the time to work on these ideas yet. Our query timeouts are generously above what a sparsely used HTTP connection supports in practice. You should also not be worried about GRAPH traversals, in our practice simple DISTINCT queries are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE (?subject ?predicate ?object) that some people like to send daily can be a challenge on large databases. We could institute draconian timeouts but we don’t because we want to get the difficult queries, the simple our users can do on our main site, but the analytical ones require a solution such as SPARQL.
Maintaining a SPARQL endpoint for public use requires you to focus on client management not query management. You can have one client with relatively easy queries, but one that happily sends you 40,000 of these per second. Others send you a query that looks complicated and expensive but is actually very selective and ends up taking near to no time at all.
One of the solutions that we will look into long term is a query manager in front of our endpoint. I will try to explain the idea behind that but if its fuzzy just ask and I will try to explain better ;) Assume, you have a CPU budget available for hosting your SPARQL endpoint. Lets say 100 ticks per day. If you have one user that user should be able to get all 100 ticks. If you have one hundred users each user gets 1 tick. If you limit your one user to 1 tick you have wasted 99 ticks. If you let the fist user use 100 ticks and the other ninety-nine users their 1, you have broken the budget by 99 ticks. One solution is therefore to allow clients to optimistically run a query for longer than their ticks, but the moment that other clients arrive they get kicked off. e.g. complex queries run ok on sunday evening but not on monday 10am.
You can do this by having a service in front of your API that takes SPARQL queries, queues them for execution in a priority queue. Sends back a 303 response with a retry-after and location header so that the HTTP responses don’t die. The priority queue ensures everyone can get a turn to run their queries, by reducing the priority of people who are already running a query etc… The query management service can then see if a query returned in its tick and if anyone else has scheduled a query. If someone else has a query outstanding the query manager tells the SPARQL engine to stop working on the query and send back any results if possible. I don’t know if BlazeGraph allows this but I know how to do it for GraphDB and have an idea about how to do it in Virtuoso, I am however sure that Systap can add this to BlazeGraph on short notice. Such a query manager will deal with all common forms of DOS attacks.
Now to the advantages of SPARQL over the other options Gremlin included is that it is a open standard that is deployed in the wild for use of HTTP by complete strangers. Not just in academia but also e.g. at http://open-data.europa.eu/en/linked-data so its more likely that you can combine data in primary resources with data in WikiData. Also you won’t be the only ones worrying about attacks on your public endpoint and will have a larger community to share knowledge and code with.
On the Reification business, we use it extensively and in our case about 15% of our triples are in reification quads. But that is the surface representation how that is actually stored in disk (or materialised on demand) is completely different and not something you should worry about at this time. Most of the time you don’t need reification and can often be avoided by better modelling. For example in our archive of protein sequences we used to use reification to note that a protein sequence can no longer be found in a database entry. Now we model a database entry just like all the other active entries we just give it a new unique URI and a rdf:type to say its obsolete and that it corresponds to a version of an active database entry. When teaching SPARQL to scientist I call this way of thinking “model the measurement result, not the conclusion”. e.g. instead of saying
example:Jerven example:length_in_cm 197 .
model
example:Jerven example:measured_length [ example:length_measurement_result_in_cm 197 ; example:measurement_date “2015-02-27”^^xsd:date . ] ; [ example:length_measurement_result_in_cm 60 ; example:measurement_date “1983-06-21”^^xsd:date . ] .
As you can see in the second model you don’t have to worry about invalidating data as the measurement stays correct. You would have been forced to model like this in practice with Titan, and you should do the same in RDF. Just because reification is available does not mean you must use it ;)
I hope that this mail was constructive and that some of your worries will be less knowing that there are (possible) solutions to maintain a reliable service.
Regards, Jerven
PS. There are more opensource options for SPARQL eg. RDF4J
Hey! Thanks for the email! Its reasonably similar to our experience working on other similar services. I'll respond some inline.
On Thu, Mar 12, 2015 at 1:21 PM, Jerven Tjalling Bolleman < Jerven.Bolleman@isb-sib.ch> wrote:
Dear WikiData Developers,
I am very pleased to read your intention to deploy a SPARQL based API for WikiData. I am one of the developers behind sparql.uniprot.org a public free to use SPARQL endpoint (with content under CC-ND, soon to be made more liberal). UniProt can be seen as an encyclopaedia of facts about proteins, and we have about a 90 million pages of entries e.g. http://www.uniprot.org/uniprot/P49407 (corresponds to http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in UniProtKB (and another 200 million or so in our supporting related datasets) lead to a bit more than 19 billion unique triples.
We currently use Virtuoso, but I completely understand why you select BlazeGraph, it seems to me a better fit for your mission and deployment scenario. I evaluated what was then BigData 3 years ago and would have selected it if we did not need to go for vertical scaling due to having just 1U in our computer room available.
I also understand the worries that you have about being able to run a service that is resilient to DOS (intentional or not). We have not yet needed to run any mitigation strategies that are more complex than banning a IP+user-agent string. However, we do have a few ideas about how to make that automatic. Even if we don’t have the time to work on these ideas yet. Our query timeouts are generously above what a sparsely used HTTP connection supports in practice. You should also not be worried about GRAPH traversals, in our practice simple DISTINCT queries are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE (?subject ?predicate ?object) that some people like to send daily can be a challenge on large databases. We could institute draconian timeouts but we don’t because we want to get the difficult queries, the simple our users can do on our main site, but the analytical ones require a solution such as SPARQL.
Cool! I'm primarily interested in making sure timeouts work and queries run in a way that they can't swamp the Java heap. My _hope_ is that that is good enough. Banning particular constructions isn't something I'd like to do but its certainly possible - I prototyped it a week or so ago. Its just a tool I'd rather not reach for.
Maintaining a SPARQL endpoint for public use requires you to focus on client management not query management. You can have one client with relatively easy queries, but one that happily sends you 40,000 of these per second. Others send you a query that looks complicated and expensive but is actually very selective and ends up taking near to no time at all.
One of the solutions that we will look into long term is a query manager in front of our endpoint. I will try to explain the idea behind that but if its fuzzy just ask and I will try to explain better ;) Assume, you have a CPU budget available for hosting your SPARQL endpoint. Lets say 100 ticks per day. If you have one user that user should be able to get all 100 ticks. If you have one hundred users each user gets 1 tick. If you limit your one user to 1 tick you have wasted 99 ticks. If you let the fist user use 100 ticks and the other ninety-nine users their 1, you have broken the budget by 99 ticks. One solution is therefore to allow clients to optimistically run a query for longer than their ticks, but the moment that other clients arrive they get kicked off. e.g. complex queries run ok on sunday evening but not on monday 10am.
Nice! We've talked about some of this with our full text search. There is different - its harder to construct overwhelming queries but they can't be killed. BlazeGraph supports killing queries but its pretty trivial to write nasty queries. We have tools for limiting user concurrent requests and total requests. This is neat because its strictly better than timeouts. We'll certainly give it a shot!
You can do this by having a service in front of your API that takes SPARQL queries, queues them for execution in a priority queue. Sends back a 303 response with a retry-after and location header so that the HTTP responses don’t die. The priority queue ensures everyone can get a turn to run their queries, by reducing the priority of people who are already running a query etc… The query management service can then see if a query returned in its tick and if anyone else has scheduled a query. If someone else has a query outstanding the query manager tells the SPARQL engine to stop working on the query and send back any results if possible. I don’t know if BlazeGraph allows this but I know how to do it for GraphDB and have an idea about how to do it in Virtuoso, I am however sure that Systap can add this to BlazeGraph on short notice. Such a query manager will deal with all common forms of DOS attacks.
Now to the advantages of SPARQL over the other options Gremlin included is that it is a open standard that is deployed in the wild for use of HTTP by complete strangers. Not just in academia but also e.g. at http://open-data.europa.eu/en/linked-data so its more likely that you can combine data in primary resources with data in WikiData. Also you won’t be the only ones worrying about attacks on your public endpoint and will have a larger community to share knowledge and code with.
On the Reification business, we use it extensively and in our case about 15% of our triples are in reification quads. But that is the surface representation how that is actually stored in disk (or materialised on demand) is completely different and not something you should worry about at this time. Most of the time you don’t need reification and can often be avoided by better modelling. For example in our archive of protein sequences we used to use reification to note that a protein sequence can no longer be found in a database entry. Now we model a database entry just like all the other active entries we just give it a new unique URI and a rdf:type to say its obsolete and that it corresponds to a version of an active database entry. When teaching SPARQL to scientist I call this way of thinking “model the measurement result, not the conclusion”. e.g. instead of saying
example:Jerven example:length_in_cm 197 .
model
example:Jerven example:measured_length [ example:length_measurement_result_in_cm 197 ; example:measurement_date “2015-02-27”^^xsd:date . ] ; [ example:length_measurement_result_in_cm 60 ; example:measurement_date “1983-06-21”^^xsd:date . ] .
As you can see in the second model you don’t have to worry about invalidating data as the measurement stays correct. You would have been forced to model like this in practice with Titan, and you should do the same in RDF. Just because reification is available does not mean you must use it ;)
I have opinions around simplicity - mostly that, for the best rank subset of data, the system should behave as though example:Jerven example:length_in_cm 197 . was in the db. If that is done using inference, query rewriting, pregenerated triples, or unicorns, I'm not sure yet. That is what we're working on now because they seem like big tradeoffs.
Probably incorrectly, when I say reification I think it means "all of the references, qualifiers, and parts of the value (like precision and error) are available" not the specific construction involving blank nodes. We, eventually, need all that stuff in the query service. We're quite unlikely to actually use the blank nodes way of describing it though.
Nik
wikidata-tech@lists.wikimedia.org