Hello all,
The SPARQL endpoint we are running at http://query.wikidata.org has several measures in place in order to ensure it stays up and running and available for everyone, for example the 30 sec query timeout. This is necessary but also prevents some useful queries from being run. One way around this is Linked Data Fragments. It allows for some of the query computation to be done on the client-side instead of our server.
We have set this up now for testing and would appreciate your testing and feedback. You can find out more about Linked Data Fragments http://linkeddatafragments.org/concept/ and documentation for our installation https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Linked_Data_Fragments_endpoint. Also, you can see a demo of client-side SPARQL evaluation and LDF server usage here: http://ldfclient.wmflabs.org/
Please note - it's in no way a production service for anything, just a proof-of-concept deployment of LDF client. If you like how it works, you can get it from the source https://github.com/LinkedDataFragments/jQuery-Widget.js and deploy it on your own setup.
Feel free to ask Stas (Smalyshev (WMF)) for any further question!
Hi,
Interesting. I encountered some problems in my initial tests though:
(1) The results do not seem to be correct. The example query related to films returns 55 results, while on the official endpoint it returns 128. It seems that this is not because of missing data, but because of wrong multiplicities (the correct result has several rows repeated multiple times). Is there an implicit DISTINCT applied in this service somewhere? Are there any other changes from the normal SPARQL semantics?
(2) It is really slow. The sample query took 55s on my machine (producing only half of the results), while it takes 0.2s on WDQS. I am afraid that hard queries which would timeout on WDQS might take too long to be used at all.
However, I would still like to try it with one of our harder queries. Can I use the service from a program (my harder queries have too many results to be displayed in a browser -- this is why they are hard ;-)? Ideally, I would like to use it like a SPARQL service that I send a request to. Is this possible?
Best,
Markus
On 21.12.2016 09:23, Léa Lacroix wrote:
Hello all,
The SPARQL endpoint we are running at http://query.wikidata.org has several measures in place in order to ensure it stays up and running and available for everyone, for example the 30 sec query timeout. This is necessary but also prevents some useful queries from being run. One way around this is Linked Data Fragments. It allows for some of the query computation to be done on the client-side instead of our server.
We have set this up now for testing and would appreciate your testing and feedback. You can find out more about Linked Data Fragments http://linkeddatafragments.org/concept/ and documentation for our installation https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Linked_Data_Fragments_endpoint. Also, you can see a demo of client-side SPARQL evaluation and LDF server usage here: http://ldfclient.wmflabs.org/
Please note - it's in no way a production service for anything, just a proof-of-concept deployment of LDF client. If you like how it works, you can get it from the source https://github.com/LinkedDataFragments/jQuery-Widget.js and deploy it on your own setup.
Feel free to ask Stas (Smalyshev (WMF)) for any further question!
-- Léa Lacroix Community Communication Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi,
I also found other cases where the results are not correct. For example, if you try the example query "overall causes of death ranking", there is no error and a long computation (eating up most of my CPU) and, eventually, no results.
Cheers,
Markus
On 21.12.2016 11:11, Markus Kroetzsch wrote:
Hi,
Interesting. I encountered some problems in my initial tests though:
(1) The results do not seem to be correct. The example query related to films returns 55 results, while on the official endpoint it returns 128. It seems that this is not because of missing data, but because of wrong multiplicities (the correct result has several rows repeated multiple times). Is there an implicit DISTINCT applied in this service somewhere? Are there any other changes from the normal SPARQL semantics?
(2) It is really slow. The sample query took 55s on my machine (producing only half of the results), while it takes 0.2s on WDQS. I am afraid that hard queries which would timeout on WDQS might take too long to be used at all.
However, I would still like to try it with one of our harder queries. Can I use the service from a program (my harder queries have too many results to be displayed in a browser -- this is why they are hard ;-)? Ideally, I would like to use it like a SPARQL service that I send a request to. Is this possible?
Best,
Markus
On 21.12.2016 09:23, Léa Lacroix wrote:
Hello all,
The SPARQL endpoint we are running at http://query.wikidata.org has several measures in place in order to ensure it stays up and running and available for everyone, for example the 30 sec query timeout. This is necessary but also prevents some useful queries from being run. One way around this is Linked Data Fragments. It allows for some of the query computation to be done on the client-side instead of our server.
We have set this up now for testing and would appreciate your testing and feedback. You can find out more about Linked Data Fragments http://linkeddatafragments.org/concept/ and documentation for our installation https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Linked_Data_Fragments_endpoint.
Also, you can see a demo of client-side SPARQL evaluation and LDF server usage here: http://ldfclient.wmflabs.org/
Please note - it's in no way a production service for anything, just a proof-of-concept deployment of LDF client. If you like how it works, you can get it from the source https://github.com/LinkedDataFragments/jQuery-Widget.js and deploy it on your own setup.
Feel free to ask Stas (Smalyshev (WMF)) for any further question!
-- Léa Lacroix Community Communication Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
(1) The results do not seem to be correct. The example query related to films returns 55 results, while on the official endpoint it returns 128. It seems that this is not because of missing data, but because of wrong multiplicities (the correct result has several rows repeated multiple times). Is there an implicit DISTINCT applied in this service somewhere? Are there any other changes from the normal SPARQL semantics?
You mean the results on http://ldfclient.wmflabs.org/ or the results returned directly from the endpoint?
Note that http://ldfclient.wmflabs.org/ is just a demo. It's not a production service, it's just a showcase of what can be done using LDF. So it's possible SPARQL implementation there is somehow buggy or different from others. I think it relies on https://www.npmjs.com/package/sparqljs and https://www.npmjs.com/package/ldf-client - however I can't really vouch on what happens in that code. I'll try to see where the differences come from, but not sure it's worth spending too much time debugging a demo service. If however something is wrong with the patterns themselves that would be a serious issue.
This example is just a demo of how pattern fragments enable to take SPARQL work out of the server to the client. It's not intended as a production SPARQL service :)
(2) It is really slow. The sample query took 55s on my machine (producing only half of the results), while it takes 0.2s on WDQS. I am afraid that hard queries which would timeout on WDQS might take too long to be used at all.
Well, comparing it to WDQS is not really fair :) It's a JS implementation of SPARQL, running running in your browser and loading data over the network. Still 55 s it too much - it finishes in around 10s for me. Maybe your network is slower?
I think if it is run on real hardware on stronger JS or Java engine though it might be faster. Also, I'm sure there are other LDF clients, such as Java one: http://linkeddatafragments.org/software/
However, I would still like to try it with one of our harder queries. Can I use the service from a program (my harder queries have too many results to be displayed in a browser -- this is why they are hard ;-)?
Yes. Please tell me if something goes wrong. You may want to use non-browser client though.
Ideally, I would like to use it like a SPARQL service that I send a request to. Is this possible?
Not for this one. This is a triple pattern service, by design - it can be used by a client-side SPARQL implementation but it does not include one. Including full SPARQL implementation into it would be contradictory - the whole point of this is to shift the workload to the client, if we shift it back to the server, we're back to the regular SPARQL endpoint with all the limitations we must put on it. However, I'm pretty sure something like Client.js or Client.java mentioned above can do SPARQL queries - it's how the demo works.
Hi Stas,
Thanks for the info. Yes, all my comments apply to the ldf demo. I understand that it is a demo, and what the motivation is on paper, but if it returns incorrect results, then it is of little use. You can get those without any load on server or client ;-).
Also, there should be some way of doing queries that don't run on WDQS already, i.e., there must be something that times out now but can be done with ldf in a reasonable time [1]. Or are federated queries the main goal here? (that's still useful, but I hope that WDQS will also support a whitelisted set of external endpoint at some time)
Best,
Markus
[1] Upper bound for "reasonable": time it takes to download the RDF dump, install Blazegraph locally without timeout, load the dump, and get your query answered there ;-)
On 21.12.2016 18:10, Stas Malyshev wrote:
Hi!
(1) The results do not seem to be correct. The example query related to films returns 55 results, while on the official endpoint it returns 128. It seems that this is not because of missing data, but because of wrong multiplicities (the correct result has several rows repeated multiple times). Is there an implicit DISTINCT applied in this service somewhere? Are there any other changes from the normal SPARQL semantics?
You mean the results on http://ldfclient.wmflabs.org/ or the results returned directly from the endpoint?
Note that http://ldfclient.wmflabs.org/ is just a demo. It's not a production service, it's just a showcase of what can be done using LDF. So it's possible SPARQL implementation there is somehow buggy or different from others. I think it relies on https://www.npmjs.com/package/sparqljs and https://www.npmjs.com/package/ldf-client - however I can't really vouch on what happens in that code. I'll try to see where the differences come from, but not sure it's worth spending too much time debugging a demo service. If however something is wrong with the patterns themselves that would be a serious issue.
This example is just a demo of how pattern fragments enable to take SPARQL work out of the server to the client. It's not intended as a production SPARQL service :)
(2) It is really slow. The sample query took 55s on my machine (producing only half of the results), while it takes 0.2s on WDQS. I am afraid that hard queries which would timeout on WDQS might take too long to be used at all.
Well, comparing it to WDQS is not really fair :) It's a JS implementation of SPARQL, running running in your browser and loading data over the network. Still 55 s it too much - it finishes in around 10s for me. Maybe your network is slower?
I think if it is run on real hardware on stronger JS or Java engine though it might be faster. Also, I'm sure there are other LDF clients, such as Java one: http://linkeddatafragments.org/software/
However, I would still like to try it with one of our harder queries. Can I use the service from a program (my harder queries have too many results to be displayed in a browser -- this is why they are hard ;-)?
Yes. Please tell me if something goes wrong. You may want to use non-browser client though.
Ideally, I would like to use it like a SPARQL service that I send a request to. Is this possible?
Not for this one. This is a triple pattern service, by design - it can be used by a client-side SPARQL implementation but it does not include one. Including full SPARQL implementation into it would be contradictory
- the whole point of this is to shift the workload to the client, if we
shift it back to the server, we're back to the regular SPARQL endpoint with all the limitations we must put on it. However, I'm pretty sure something like Client.js or Client.java mentioned above can do SPARQL queries - it's how the demo works.
Hi!
Also, there should be some way of doing queries that don't run on WDQS already, i.e., there must be something that times out now but can be done with ldf in a reasonable time [1]. Or are federated queries the main goal here? (that's still useful, but I hope that WDQS will also support a whitelisted set of external endpoint at some time)
Yes, the plan is to have a whitelist of SPARQL endpoints for which we allow federation queries. I was going to post about it in January when everybody's back from vacations :)
The use case for LDF is both running long queries and federation. The first use case of course would benefit from good SPARQL engine running on a machine with decent connectivity. So server-side JS or Java should probably fare better.
Yes, the plan is to have a whitelist of SPARQL endpoints for which we allow federation queries. I was going to post about it in January when everybody's back from vacations :)
I'm very curious to see how this will fare. Federation is where I think public SPARQL endpoints will fail, so it will be worthwhile to see what happens.
The use case for LDF is both running long queries and federation. The first use case of course would benefit from good SPARQL engine running on a machine with decent connectivity. So server-side JS or Java should probably fare better.
Actually, probably not: – The V8 JavaScript engine is as fast in the browser as with Node.js. Furthermore, query execution in the browser happens in a background worker, and the UI has been designed for minimal impact. – The Java implementation is based on Jena, which is an engine built with very strong assumptions that do not necessarily hold for TPF.
The main impact I see is latency (California <=> Europe), which can be mitigated with caching and/or HTTP/2.
Best,
Ruben
On 12/21/16 4:13 PM, Ruben Verborgh wrote:
Federation is where I think public SPARQL endpoints will fail, so it will be worthwhile to see what happens.
Really, then you will ultimately be surprised on that front too!
Hi Stas,
Regarding your guess that network connectivity could cause my problem: I got the 55sec time while in my office, connected via an Ethernet cable to the university network. This is more or less directly wired up to the backbone of the Internet, so bandwidth cannot be the issue here.
I now got better times for that query on Chrome (my first try was on Firefox). Maybe the different Javascript speeds are the main issue here. However, most other queries I try still time out or at least don't give any results for some reason, even on Chrome.
Best,
Markus
On 21.12.2016 22:07, Stas Malyshev wrote:
Hi!
Also, there should be some way of doing queries that don't run on WDQS already, i.e., there must be something that times out now but can be done with ldf in a reasonable time [1]. Or are federated queries the main goal here? (that's still useful, but I hope that WDQS will also support a whitelisted set of external endpoint at some time)
Yes, the plan is to have a whitelist of SPARQL endpoints for which we allow federation queries. I was going to post about it in January when everybody's back from vacations :)
The use case for LDF is both running long queries and federation. The first use case of course would benefit from good SPARQL engine running on a machine with decent connectivity. So server-side JS or Java should probably fare better.
Hi Markus,
I got the 55sec time while in my office, connected via an Ethernet cable to the university network. This is more or less directly wired up to the backbone of the Internet, so bandwidth cannot be the issue here.
The latency is the main culprit, not bandwidth.
I also noted a big difference in query times when I was in California last summer (my servers are located in Europe).
I now got better times for that query on Chrome (my first try was on Firefox). Maybe the different Javascript speeds are the main issue here.
The Chrome engine is indeed faster than the Firefox engine.
However, most other queries I try still time out or at least don't give any results for some reason, even on Chrome.
Feel free to mail me examples, I can always look into them.
Best,
Ruben
Hi!
Regarding your guess that network connectivity could cause my problem: I got the 55sec time while in my office, connected via an Ethernet cable to the university network. This is more or less directly wired up to the backbone of the Internet, so bandwidth cannot be the issue here.
I now got better times for that query on Chrome (my first try was on Firefox). Maybe the different Javascript speeds are the main issue here. However, most other queries I try still time out or at least don't give any results for some reason, even on Chrome.
I plan to install it on a separate VM in more stable environment and then re-check what happens. I suspect queries giving no results may be some instability in the setup I've created (didn't expect it to be tested that thoroughly :) I'll update when I've made more robust setup (probably later today) and we can see if it improves things.
Hi Markus,
Answering this as the LDF lead developer.
(1) The results do not seem to be correct. The example query related to films returns 55 results, while on the official endpoint it returns 128. It seems that this is not because of missing data, but because of wrong multiplicities (the correct result has several rows repeated multiple times). Is there an implicit DISTINCT applied in this service somewhere?
No, there isn't. Let me investigate what goes wrong there.
Are there any other changes from the normal SPARQL semantics?
There should be none; any you find are bugs.
(2) It is really slow.
Depending on your definition of "really", yes.
For this, I'd like to point to the overall aim of the LDF project, as documented on our website and papers. Summarizing: the SemWeb community has almost exclusively cared about speed so far concerning query execution. This has resulted in super-fast, but super-expensive services, which simply don't work on the public Web. More than half of all public SPARQL endpoints are down for more than 1.5 days each month [1].
I started LDF with the idea: what if other things are important as well? What if it is acceptable to trade speed for lower server cost and higher cacheability and scalability?
The other alternatives are: – downloading a data dump and querying yourself (would take >= 20 mins) – executing traversal-based querying (incomplete and likely >= 10 mins) So in that sense, TPF is slow, but not as slow as other alternatives. Furthermore, these alternatives have other problems, such as bandwidth, remaining up-to-date, and completeness.
The sample query took 55s on my machine (producing only half of the results), while it takes 0.2s on WDQS.
Exactly, that's the trade-off we offer. And in many cases, it isn't as bad as in this example. And in many others, it's even worse. But that's something we accept; Web scalability is our main goal. Everything is possible, some things just very slowly.
WDQS is exceptional, in the sense that it has an uptime unlike any public SPARQL endpoint. But for the average public SPARQL endpoint, my answer would be: “Yes, that's 0.2s–if you're lucky. Otherwise, it can take 1h for the endpoint to come up again. The TPF interface is 55s, but consistently so.”
We document the speed/cost trade-off extensively in our research [2], especially in our ISWC paper [3] and JWS article [4].
The JWS article also shows that we perform great on federation, in some cases even better than the state-of-the-art with SPARQL endpoints.
I am afraid that hard queries which would timeout on WDQS might take too long to be used at all.
True, but they would cost less on the server. And that's what we optimize for, not speed.
Can I use the service from a program
Yes: https://github.com/LinkedDataFragments/Client.js
Ideally, I would like to use it like a SPARQL service that I send a request to. Is this possible?
Yes, the above software package also has a SPARQL endpoint.
I understand that it is a demo, and what the motivation is on paper
So for clarity: the motivation is low server cost, and easy federation.
if it returns incorrect results, then it is of little use.
Of course; I'll look into the bug.
Or are federated queries the main goal here? (that's still useful, but I hope that WDQS will also support a whitelisted set of external endpoint at some time)
Not just about the whitelist; TPF has been shown to do several cases of federation faster and with higher completeness.
Best,
Ruben
[1] https://aran.library.nuigalway.ie/handle/10379/4545 [2] http://linkeddatafragments.org/publications/ [3] http://linkeddatafragments.org/publications/iswc2014.pdf [4] http://linkeddatafragments.org/publications/jws2016.pdf
On 12/21/16 2:52 PM, Ruben Verborgh wrote:
For this, I'd like to point to the overall aim of the LDF project, as documented on our website and papers. Summarizing: the SemWeb community has almost exclusively cared about speed so far concerning query execution. This has resulted in super-fast, but super-expensive services, which simply don't work on the public Web. More than half of all public SPARQL endpoints are down for more than 1.5 days each month [1].
Ruben,
The Semantic Web community hasn't focused exclusively on query execution speed.
Anyone that encounters a service (Web or Semantic Web) expects results in acceptable timeframes (typically <= 250ms) , that's a function of user behavior on the Web or anywhere else. Thus, a less overarching characterization would be as follows: The Linked Open Data community, a sub segment of the Semantic Web community, has focused on providing solutions that work, a prominent example (that I know well) is DBpedia, and many bubbles around it in the LOD Cloud.
You will find that Wikidata, is doing the very same thing, but with much more hardware at their disposal, since they have more funding than DBpedia, at this point in time.
That basic response time expectations of users drives everything, all the time.
The key issue here is all about what method a given service providers chooses en route to addressing the expectations of users, as I've outlined above. Fundamentally, each service provider will use a variety of solution deployment techniques that boil down to:
1. Massive Server Clusters (sharded) and Proxies
2. Fast multi-threaded instances (no sharding but via replication topologies) behind proxies (functioning as cops, so to speak).
Your "Simply doesn't work on the public Web" claim is subjective, I've told you that repeatedly. I am sure others will ultimately tell you the very same thing :)
Hi Kingsley,
The Semantic Web community hasn't focused exclusively on query execution speed.
Let me clarify myself: the scientific SemWeb community mostly focused on speed, as is apparent from publications about SPARQL query execution (and, from personal experience, many researchers and reviewers still having trouble to understand why speed is not our main focus).
Anyone that encounters a service (Web or Semantic Web) expects results in acceptable timeframes (typically <= 250ms) , that's a function of user behavior on the Web or anywhere else.
Yes, and it is my opinion that public SPARQL endpoints overpromise in that regard. The whole public SPARQL endpoint discourse has made us believe that it is actually realistic to have free+fast+high availability, as is the case for any other Web service. But given that SPARQL is more expressive per request than any other Web service I know, this cannot hold.
In simple terms: SPARQL is a very expressive and hence very expensive API.
In technical terms: show me any other API that exposes a PSPACE-complete interface.
You will find that Wikidata, is doing the very same thing, but with much more hardware at their disposal, since they have more funding than DBpedia, at this point in time.
Indeed.
Your "Simply doesn't work on the public Web" claim is subjective
Let me clarify "simply doesn't work": companies/institutions that host their data in any other API on the Web will see a substantial increase in server costs when they try to host that same data as a public SPARQL HTTP service. My claim is that this increase is so substantial, that SPARQL endpoints cannot become a reality on the public Web at the same customer cost (= often free) of any other API on that same Web, and hence will not become a reality. Concretely, for most institutions that want to make their data queryable for free, the SPARQL protocol will simply be too expensive for their budgets. Alternatives, like dumps, LD documents, TPF, might be feasible, but they all come at another cost. No silver bullet.
So far, that claim has not been proven wrong.
Best,
Ruben
On 12/21/16 4:57 PM, Ruben Verborgh wrote:
Hi Kingsley,
The Semantic Web community hasn't focused exclusively on query execution speed.
Let me clarify myself: the scientific SemWeb community mostly focused on speed, as is apparent from publications about SPARQL query execution (and, from personal experience, many researchers and reviewers still having trouble to understand why speed is not our main focus).
Research papers and conference workshops have focused on these matters, for a variety of reasons.
As I said, the driver for this focus in reality, is the 250 msec response time which is a key threshold for human attentions when working with solutions (on or offline).
Anyone that encounters a service (Web or Semantic Web) expects results in acceptable timeframes (typically <= 250ms) , that's a function of user behavior on the Web or anywhere else.
Yes, and it is my opinion that public SPARQL endpoints overpromise in that regard.
They don't.
SPARQL endpoints exist, and experience varies. Ditto motivations behind the endpoints.
The whole public SPARQL endpoint discourse has made us believe that it is actually realistic to have free+fast+high availability, as is the case for any other Web service.
There is no such thing as a free+fast+high availability solution that costs the solution provider $0.00. That simply doesn't exists!!
But given that SPARQL is more expressive per request than any other Web service I know, this cannot hold.
In simple terms: SPARQL is a very expressive and hence very expensive API.
In technical terms: show me any other API that exposes a PSPACE-complete interface.
SPARQL is a Query Language that includes an HTTP API accessible via SAPRQL Endpoints.
What you are not accepting is the notion of queries that complete, in a configurable query completion timeframe.
You will find that Wikidata, is doing the very same thing, but with much more hardware at their disposal, since they have more funding than DBpedia, at this point in time.
Indeed.
Your "Simply doesn't work on the public Web" claim is subjective
Let me clarify "simply doesn't work": companies/institutions that host their data in any other API on the Web will see a substantial increase in server costs when they try to host that same data as a public SPARQL HTTP service.
Again subjective. You are implying that cost vs benefit analysis don't drive decisions to put services on the Web, of course they do.
My claim is that this increase is so substantial, that SPARQL endpoints cannot become a reality on the public Web at the same customer cost (= often free) of any other API on that same Web, and hence will not become a reality.
The costs are not prohibitive. This is where I utterly completely with you.
Concretely, for most institutions that want to make their data queryable for free, the SPARQL protocol will simply be too expensive for their budgets.
Academic institutions, maybe. The rest of the world, it basic economics, value propositions, and business models.
Alternatives, like dumps, LD documents, TPF, might be feasible, but they all come at another cost. No silver bullet.
Has anyone told you that SPARQL Endpoints are a silver bullet?
Kingsley
So far, that claim has not been proven wrong.
Best,
Ruben _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Kingsley,
will see a substantial increase in server costs when they try to host that same data as a public SPARQL HTTP service.
Again subjective.
No, that's not subjective, that's perfectly measurable. And that's exactly what we did in our research.
The problem with the SPARQL protocol as an API is that the per-request cost is a) higher and b) much more variable than any other API.
Everywhere else on the Web, APIs shield data consumers from the backend, limiting the per-request complexity. That's why they thrive and SPARQL endpoints don't.
Don't get me wrong, I'm happy with every highly available SPARQL endpoint out there. Wikidata and DBpedia are awesome. It's just that there are too few and I see cost as a major factor there.
You are implying that cost vs benefit analysis don't drive decisions to put services on the Web, of course they do.
Quite the contrary, I am arguing that—and this is subjective— because cost/benefit analyses drive decisions on the Web, we will never have substantially more SPARQL endpoints on the public Web than we have now. They're just too expensive.
Federation is where I think public SPARQL endpoints will fail, so it will be worthwhile to see what happens.
Really, then you will ultimately be surprised on that front too!
I really really hope so. If one day, machines can execute queries on the Web as well as we can, I'd be really happy. My way to reach that is lightweight interfaces, but if it is possible with heavyweight interfaces, all the better.
Best,
Ruben
FWIW, WiFi/DSL access from Germany...
"Brad Pitt": 128 results in 11.2s
"Antarctic rivers": Silently fails with an internal server error, time keeps running: https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org...
On Thu, Dec 22, 2016 at 9:37 AM Ruben Verborgh Ruben.Verborgh@ugent.be wrote:
Hi Kingsley,
will see a substantial increase in server costs when they try to host that same data as a public SPARQL HTTP service.
Again subjective.
No, that's not subjective, that's perfectly measurable. And that's exactly what we did in our research.
The problem with the SPARQL protocol as an API is that the per-request cost is a) higher and b) much more variable than any other API.
Everywhere else on the Web, APIs shield data consumers from the backend, limiting the per-request complexity. That's why they thrive and SPARQL endpoints don't.
Don't get me wrong, I'm happy with every highly available SPARQL endpoint out there. Wikidata and DBpedia are awesome. It's just that there are too few and I see cost as a major factor there.
You are implying that cost vs benefit analysis don't drive decisions to put services on the Web, of course they do.
Quite the contrary, I am arguing that—and this is subjective— because cost/benefit analyses drive decisions on the Web, we will never have substantially more SPARQL endpoints on the public Web than we have now. They're just too expensive.
Federation is where I think public SPARQL endpoints will fail, so it will be worthwhile to see what happens.
Really, then you will ultimately be surprised on that front too!
I really really hope so. If one day, machines can execute queries on the Web as well as we can, I'd be really happy. My way to reach that is lightweight interfaces, but if it is possible with heavyweight interfaces, all the better.
Best,
Ruben
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi all,
just as for "testing".. from my quite old PC, running Firefox, connected to the Eduroam network via wi-fi, located in the Czechia, it gives:
"Brad Pitt": 128 results in 18.9s
So, Markus' 55s for the same query on better hardware, better connection, in better location (in theory), sounds quite strange.
Regards, Jan
On 22.12.2016 13:32, Jan Macura wrote:
Hi all,
just as for "testing".. from my quite old PC, running Firefox, connected to the Eduroam network via wi-fi, located in the Czechia, it gives:
"Brad Pitt": 128 results in 18.9s
So, Markus' 55s for the same query on better hardware, better connection, in better location (in theory), sounds quite strange.
I am on a MacBook Pro with i5 CPU and 8GB of RAM. I just retried. I just ran four times from Firefox. The results I got:
"125 results in 54.7s" (timer stopped) "18 results in 14.5s" (timer stopped) "43 results in 614.1s" (timer wouldn't stop; I cancelled this) "131 results in 37.9s" (timer stopped)
The same query seems to run faster in Chrome, where I get:
"128 results in 24.9s" "131 results in 8.1s" "131 results in 10.0s"
This part seems fairly consistent, besides the changing number of results (are there live data updates for this service that might explain this?).
Best regards,
Markus
Well, I just reran the query on WDQS and there are still 128 results. So it seems that the 131, the 18, and the 125 I was getting in my tests were not correct (in fact, only one out of 7 exact same queries returned a result that might have been correct, judging only from the number of results).
I tried "clear cache and hard reload" but I still got 131. Retrying once more gave me 128 again. After that it was again 131. Seems there is some randomness in this bug. I did those last few tests on Chrome, where at least performance is pretty consistent (about 10 sec on my machine). I am on Ethernet in the university.
Best regards,
Markus
On 22.12.2016 13:56, Markus Kroetzsch wrote:
On 22.12.2016 13:32, Jan Macura wrote:
Hi all,
just as for "testing".. from my quite old PC, running Firefox, connected to the Eduroam network via wi-fi, located in the Czechia, it gives:
"Brad Pitt": 128 results in 18.9s
So, Markus' 55s for the same query on better hardware, better connection, in better location (in theory), sounds quite strange.
I am on a MacBook Pro with i5 CPU and 8GB of RAM. I just retried. I just ran four times from Firefox. The results I got:
"125 results in 54.7s" (timer stopped) "18 results in 14.5s" (timer stopped) "43 results in 614.1s" (timer wouldn't stop; I cancelled this) "131 results in 37.9s" (timer stopped)
The same query seems to run faster in Chrome, where I get:
"128 results in 24.9s" "131 results in 8.1s" "131 results in 10.0s"
This part seems fairly consistent, besides the changing number of results (are there live data updates for this service that might explain this?).
Best regards,
Markus
One last test: after understanding that many SPARQL features are a problem, I am trying to find simple queries. Here is a Wikidata example query that has no features other than triple patterns:
SELECT ?street ?city ?named WHERE { ?street wdt:P31 wd:Q79007 . ?street wdt:P17 wd:Q142 . ?street wdt:P131 ?city . ?street wdt:P138 ?named . ?named wdt:P21 wd:Q6581072 }
("streets named after women"; I removed labels and ORDER to make it simpler)
The query should have 320 results by WDQS (in 1.3sec). It's taking a long while on my machine though. Can anyone give a runtime for this? Does it eventually find correct results?
Markus
On 22.12.2016 14:03, Markus Kroetzsch wrote:
Well, I just reran the query on WDQS and there are still 128 results. So it seems that the 131, the 18, and the 125 I was getting in my tests were not correct (in fact, only one out of 7 exact same queries returned a result that might have been correct, judging only from the number of results).
I tried "clear cache and hard reload" but I still got 131. Retrying once more gave me 128 again. After that it was again 131. Seems there is some randomness in this bug. I did those last few tests on Chrome, where at least performance is pretty consistent (about 10 sec on my machine). I am on Ethernet in the university.
Best regards,
Markus
On 22.12.2016 13:56, Markus Kroetzsch wrote:
On 22.12.2016 13:32, Jan Macura wrote:
Hi all,
just as for "testing".. from my quite old PC, running Firefox, connected to the Eduroam network via wi-fi, located in the Czechia, it gives:
"Brad Pitt": 128 results in 18.9s
So, Markus' 55s for the same query on better hardware, better connection, in better location (in theory), sounds quite strange.
I am on a MacBook Pro with i5 CPU and 8GB of RAM. I just retried. I just ran four times from Firefox. The results I got:
"125 results in 54.7s" (timer stopped) "18 results in 14.5s" (timer stopped) "43 results in 614.1s" (timer wouldn't stop; I cancelled this) "131 results in 37.9s" (timer stopped)
The same query seems to run faster in Chrome, where I get:
"128 results in 24.9s" "131 results in 8.1s" "131 results in 10.0s"
This part seems fairly consistent, besides the changing number of results (are there live data updates for this service that might explain this?).
Best regards,
Markus
Hi Markus, First test of your query from Paris, on a Mac Pro in my enterprise Wifi - On Chrome: + First run: first results after 95s (3 results) + 2nd run: first results after 25s (3 results) + 3rd run: First results after 14s (3 results) - On Firefox + First run: first results after 65s (e.g., street Q3451473)
I am waiting to see if I can get all the results....
Ghislain
El jue., 22 dic. 2016 a las 14:33, Markus Kroetzsch (< markus.kroetzsch@tu-dresden.de>) escribió:
One last test: after understanding that many SPARQL features are a problem, I am trying to find simple queries. Here is a Wikidata example query that has no features other than triple patterns:
SELECT ?street ?city ?named WHERE { ?street wdt:P31 wd:Q79007 . ?street wdt:P17 wd:Q142 . ?street wdt:P131 ?city . ?street wdt:P138 ?named . ?named wdt:P21 wd:Q6581072 }
("streets named after women"; I removed labels and ORDER to make it simpler)
The query should have 320 results by WDQS (in 1.3sec). It's taking a long while on my machine though. Can anyone give a runtime for this? Does it eventually find correct results?
Markus
On 22.12.2016 14:03, Markus Kroetzsch wrote:
Well, I just reran the query on WDQS and there are still 128 results. So it seems that the 131, the 18, and the 125 I was getting in my tests were not correct (in fact, only one out of 7 exact same queries returned a result that might have been correct, judging only from the number of results).
I tried "clear cache and hard reload" but I still got 131. Retrying once more gave me 128 again. After that it was again 131. Seems there is some randomness in this bug. I did those last few tests on Chrome, where at least performance is pretty consistent (about 10 sec on my machine). I am on Ethernet in the university.
Best regards,
Markus
On 22.12.2016 13:56, Markus Kroetzsch wrote:
On 22.12.2016 13:32, Jan Macura wrote:
Hi all,
just as for "testing".. from my quite old PC, running Firefox,
connected
to the Eduroam network via wi-fi, located in the Czechia, it gives:
"Brad Pitt": 128 results in 18.9s
So, Markus' 55s for the same query on better hardware, better connection, in better location (in theory), sounds quite strange.
I am on a MacBook Pro with i5 CPU and 8GB of RAM. I just retried. I just ran four times from Firefox. The results I got:
"125 results in 54.7s" (timer stopped) "18 results in 14.5s" (timer stopped) "43 results in 614.1s" (timer wouldn't stop; I cancelled this) "131 results in 37.9s" (timer stopped)
The same query seems to run faster in Chrome, where I get:
"128 results in 24.9s" "131 results in 8.1s" "131 results in 10.0s"
This part seems fairly consistent, besides the changing number of results (are there live data updates for this service that might explain this?).
Best regards,
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi all,
Thanks for all your feedback regarding the results and timings.
I must say I'm quite surprised to see the high number of variations; the software already started 3 years ago, and has been battle-tested many times already, so such high deviations are highly unexpected. We haven't had similar bug reports with existing interfaces, such as DBpedia etc., so something must be different here.
I wonder to what extent the Java server plays a role, as our previous tests were mostly with the Node.js server. In particular, I was able to reproduce the 131 results, and found 3 duplicates in there, so I wonder whether the server-side pagination works alright.
Lots of things to investigate for us, so keep it coming!
Thanks,
Ruben
On 22 December 2016 at 14:52, Ghislain ATEMEZING < ghislain.atemezing@gmail.com> wrote:
- On Chrome:
- First run: first results after 95s (3 results)
- 2nd run: first results after 25s (3 results)
- 3rd run: First results after 14s (3 results)
- On Firefox
- First run: first results after 65s (e.g., street Q3451473)
In my same conditions as above, + first run in FFx: first results after 106 s (3 results) + second run: even worse
J.
Final results in FFX in my settings: 166 results in 1950.2 s . Still waiting for the results on Chrome (currently 192 results in 2120s)
El jue., 22 dic. 2016 a las 15:22, Jan Macura (macurajan@gmail.com) escribió:
On 22 December 2016 at 14:52, Ghislain ATEMEZING < ghislain.atemezing@gmail.com> wrote:
- On Chrome:
- First run: first results after 95s (3 results)
- 2nd run: first results after 25s (3 results)
- 3rd run: First results after 14s (3 results)
- On Firefox
- First run: first results after 65s (e.g., street Q3451473)
In my same conditions as above,
- first run in FFx: first results after 106 s (3 results)
- second run: even worse
J. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Probably a relevant question for Stas given the varying measurements: is the TPF server being fronted by a cache? If not, that might partly explain some of the things we're seeing.
Here's an example NGINX config I use on fragments.dbpedia.org: https://gist.github.com/RubenVerborgh/6d4ac975f0f36b6d296295dd2160c50f
Best,
Ruben
[Just to finish with this test] Finally on Chrome: 320 results in 3496.4s in my settings. HTH Ghislain
El jue., 22 dic. 2016 a las 15:47, Ruben Verborgh (Ruben.Verborgh@ugent.be) escribió:
Probably a relevant question for Stas given the varying measurements: is the TPF server being fronted by a cache? If not, that might partly explain some of the things we're seeing.
Here's an example NGINX config I use on fragments.dbpedia.org: https://gist.github.com/RubenVerborgh/6d4ac975f0f36b6d296295dd2160c50f
Best,
Ruben
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 22.12.2016 16:49, Ghislain ATEMEZING wrote:
[Just to finish with this test] Finally on Chrome: 320 results in 3496.4s in my settings.
I got the same result on Chrome, but it took 4600 sec here (1h 20min).
Markus
HTH Ghislain
El jue., 22 dic. 2016 a las 15:47, Ruben Verborgh (<Ruben.Verborgh@ugent.be mailto:Ruben.Verborgh@ugent.be>) escribió:
Probably a relevant question for Stas given the varying measurements: is the TPF server being fronted by a cache? If not, that might partly explain some of the things we're seeing. Here's an example NGINX config I use on fragments.dbpedia.org <http://fragments.dbpedia.org>: https://gist.github.com/RubenVerborgh/6d4ac975f0f36b6d296295dd2160c50f Best, Ruben _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
--
"Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Finally on Chrome: 320 results in 3496.4s in my settings.
I got the same result on Chrome, but it took 4600 sec here (1h 20min).
Definitely a case of the query planner making a very wrong decision here. These are interesting examples we need to check when designing the new query engine.
For this particular query, I can imagine the star shape to be problematic for the query algorithm. Probably, the best choice here is to: – first evaluate ?named wdt:P21 wd:Q6581072 – then evaluate ?street wdt:P138 ?named . – then download all data of ?street However, the current engine will go through all 5 patterns in a chain. Other work looked at better algorithms (http://linkeddatafragments.org/publications/#query-algorithm-optimizations), but these still need to be added to the public engine.
Thanks for the feedback, this has been very interesting already. It's curious such examples never came up with DBpedia in the two years we had the fragments endpoint up. I wonder whether that's due to the Wikidata community's enthusiasm or due to the different shape of the data.
Cheers,
Ruben
Hi!
Probably a relevant question for Stas given the varying measurements: is the TPF server being fronted by a cache?
Yes, they are all behind varnish. It should be pretty easy to bypass it though - IIRC it's URL-based, so adding something like &dontcache=<random number> to the URL would bypass the cache.
Here's an example NGINX config I use on fragments.dbpedia.org: https://gist.github.com/RubenVerborgh/6d4ac975f0f36b6d296295dd2160c50f
Our nginx config is here: https://github.com/wikimedia/operations-puppet/blob/production/modules/wdqs/...
Note that this is only the backend stuff, frontend caching is handled by varnish cluster, which is kinda complex, docs are here: https://wikitech.wikimedia.org/wiki/Caching_overview
If you see anything that can be improved, please tell.
Hi Stas,
I found an important problem with the current configuration, that is likely a major factor in the performance loss we see.
I noticed that https://query.wikidata.org/bigdata/ldf is served on HTTPS, and even with HTTP/2. However, the hypermedia controls inside of the message direcs the server to the HTTP version of the URL:
$ curl -s -H "Accept: text/turtle" https://query.wikidata.org/bigdata/ldf | grep '?subject' http://query.wikidata.org/bigdata/ldf%7B?subject,predicate,object%7D"
This means that, for every request, the browser has to perform 2 requests: 1. the request to http://query.wikidata.org/bigdata/ldf?%E2%80%A6, which will redirect to the https:// version 2. the request to https://query.wikidata.org/bigdata/ldf?%E2%80%A6 So this gives us all of the drawbacks of HTTP 1.1 and none of the advantages of HTTP/2 :-)
Could you please set "baseURL": "https://query.wikidata.org/bigdata/" and verify that this gives indeed https://query.wikidata.org/bigdata/ldf%7B?subject,predicate,object%7D" in the resulting pages?
Best,
Ruben
Yup, I believe the ticket for that is https://phabricator.wikimedia.org/T153897
On 22.12.2016 21:53, Ruben Verborgh wrote:
Hi Stas,
I found an important problem with the current configuration, that is likely a major factor in the performance loss we see.
I noticed that https://query.wikidata.org/bigdata/ldf is served on HTTPS, and even with HTTP/2. However, the hypermedia controls inside of the message direcs the server to the HTTP version of the URL:
$ curl -s -H "Accept: text/turtle" https://query.wikidata.org/bigdata/ldf | grep '?subject' http://query.wikidata.org/bigdata/ldf{?subject,predicate,object}"
This means that, for every request, the browser has to perform 2 requests:
- the request to http://query.wikidata.org/bigdata/ldf?%E2%80%A6, which will redirect to the https:// version
- the request to https://query.wikidata.org/bigdata/ldf?%E2%80%A6
So this gives us all of the drawbacks of HTTP 1.1 and none of the advantages of HTTP/2 :-)
Could you please set "baseURL": "https://query.wikidata.org/bigdata/" and verify that this gives indeed https://query.wikidata.org/bigdata/ldf%7B?subject,predicate,object%7D" in the resulting pages?
Best,
Ruben _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I found an important problem with the current configuration, that is likely a major factor in the performance loss we see.
I noticed that https://query.wikidata.org/bigdata/ldf is served on HTTPS, and even with HTTP/2. However, the hypermedia controls inside of the message direcs the server to the HTTP version of the URL:
Yes, I noticed that too, filed as: https://phabricator.wikimedia.org/T153897
Unfortunately, we are now in deployment freeze, until January, so it's hard to fix it immediately, as proper fix requires editing some configs and deploying some patches, and it doesn't seem to be production-breaking. But starting January 2 we have deployments again so it'll be fixed.
But, it should not be that big of an issue since Wikimedia servers use HSTS which means no actual http requests are performed in a modern browser). So I don't think it slows things down that much.
Hi!
I tried "clear cache and hard reload" but I still got 131. Retrying once more gave me 128 again. After that it was again 131. Seems there is some randomness in this bug. I did those last few tests on Chrome, where at least performance is pretty consistent (about 10 sec on my machine). I am on Ethernet in the university.
I get 128 on SPARQL, so I assume it's the correct answer. I'll try to see what the difference is - I suspect it may be slight difference in language match implementation or something like that. If it were *less* results, I'd assume it is some network/delivery problem, but don't know yet what to think about 3 more results.
I get 128 and 131 results intermittently, so I suspect some kind of bug, not sure where yet.
On 12/22/16 3:37 AM, Ruben Verborgh wrote:
Hi Kingsley,
will see a substantial increase in server costs when they try to host that same data as a public SPARQL HTTP service.
Again subjective.
No, that's not subjective, that's perfectly measurable. And that's exactly what we did in our research.
That doesn't negate the fact that your world view is subjective. You've started this whole thing an fuzzy premise. For instance, why do you think SPARQL exists, and how have you arrived at the conclusion that it is some kind of Semantic Web frontier?
SPARQL Query Services are just one of many data definition and manipulation services available to HTTP network users (public or private) working with RDF relations.
In some cases, service providers use SPARQL to facilitate and/or compliment Linked Open Data publishing efforts.
The problem with the SPARQL protocol as an API is that the per-request cost is a) higher and b) much more variable than any other API.
A Protocol isn't the same thing as an Application Programming Interface (API), in my world view. APIs provide interaction abstraction over protocols.
ODBC and JDBC are APIs for building applications against RDBMS applications that interact with relations represented as Tables, using SQL (and in the case of Virtuoso, SQL, SPARQL, and the SPASQL hybrid). Those APIs include abstractions over TCP/IP and other protocols. Jena, Sesame, Redland, and others do provide APIs that offer similar functionality to the aforementioned, with regards to RDF triple and quad stores.
The SPARQL Protocol extends HTTP with an ability to include SPARQL queries and solutions as part of its request and response payloads.
IMHO, your position is based on a claim that isn't being made by SPARQL compliant product providers. I continue to sense some confusion about how it has been used and spoken about, with regards to the early days of the LOD community i.e., there's no Linked Data without a SPARQL endpoint, or use of SPARQL etc..
SPARQL Query Language, Protocol, and Results Serialization Formats are simply tools, like many others, that can be used solve a variety of problems. Nobody every claimed (as far as I know) that the SPARQL composite is (or was) a "silver bullet" .
Everywhere else on the Web, APIs shield data consumers from the backend, limiting the per-request complexity. That's why they thrive and SPARQL endpoints don't.
See my comment above. Your characterization is inaccurate.
Don't get me wrong, I'm happy with every highly available SPARQL endpoint out there. Wikidata and DBpedia are awesome. It's just that there are too few and I see cost as a major factor there.
It's hard to understand the statement above. Fundamentally, Wikidata & DBpedia have addressed specific challenges and an inability of others to emulate (in your world view) has little to do with SPARQL and everything to do with motivation, engineering capability, and general experience with RDBMS technology.
You are implying that cost vs benefit analysis don't drive decisions to put services on the Web, of course they do.
Quite the contrary, I am arguing that—and this is subjective— because cost/benefit analyses drive decisions on the Web, we will never have substantially more SPARQL endpoints on the public Web than we have now. They're just too expensive.
Like the statement you made prior, I am struggling to understand your point. You can't simply throw "too expensive" at something, and decide that's definitive for everyone. That simply isn't the route to a coherent pitch.
You are taking the world view of a niche and declaring it universal. What entity (in this case: Person or Organization) profile would find this endeavor expensive? A student, academic institution, commercial company, government?
Federation is where I think public SPARQL endpoints will fail, so it will be worthwhile to see what happens.
Really, then you will ultimately be surprised on that front too!
I really really hope so. If one day, machines can execute queries on the Web as well as we can, I'd be really happy.
I still don't really understand what you mean by "as well as we can". All I've seen thus far is a pitch about availability that is justifiably slow, combined with an inability to deal with complex queries. I also notice that you don't say much about:
1. change sensitivity and ; 2. actual data loading and deployment time, in a rapidly changing world increasingly driven by data.
My way to reach that is lightweight interfaces, but if it is possible with heavyweight interfaces, all the better.
Again, heavyweight and lightweight are totally subjective characterizations :)
Best,
Ruben
Hi!
You will find that Wikidata, is doing the very same thing, but with much more hardware at their disposal, since they have more funding than DBpedia, at this point in time.
Well, we now are running on two servers (and getting another one hopefully next Q), with mirror set on standby in case of disaster. It's not *much* more hardware I think (no idea what HW DBpedia one runs on but it must be at least one server? ;). But IIRC DBpedia has only static snapshot - please correct me here if I'm wrong - while we do live update. Which yes, makes performance an ongoing concern. That's why we have 30s timeout and connection limits :)
Still, expecting sub-second responses for SPARQL on any random query on billion-sized database sounds unrealistic to me.
The key issue here is all about what method a given service providers chooses en route to addressing the expectations of users, as I've outlined above. Fundamentally, each service provider will use a variety of solution deployment techniques that boil down to:
- Massive Server Cluzsters (sharded) and Proxies
Sharding makes querying much harder IIUC. Though would like to see some data on how big DBs behave under sharding vs. just distributing requests across servers.
- Fast multi-threaded instances (no sharding but via replication
topologies) behind proxies (functioning as cops, so to speak).
That's basically what we're doing now.
Dear Kingsley,
TL;DR: +1
On 21.12.2016 22:45, Kingsley Idehen wrote:
On 12/21/16 2:52 PM, Ruben Verborgh wrote:
For this, I'd like to point to the overall aim of the LDF project, as documented on our website and papers. Summarizing: the SemWeb community has almost exclusively cared about speed so far concerning query execution. This has resulted in super-fast, but super-expensive services, which simply don't work on the public Web. More than half of all public SPARQL endpoints are down for more than 1.5 days each month [1].
Ruben,
The Semantic Web community hasn't focused exclusively on query execution speed.
Anyone that encounters a service (Web or Semantic Web) expects results in acceptable timeframes (typically <= 250ms) , that's a function of user behavior on the Web or anywhere else. Thus, a less overarching characterization would be as follows: The Linked Open Data community, a sub segment of the Semantic Web community, has focused on providing solutions that work, a prominent example (that I know well) is DBpedia, and many bubbles around it in the LOD Cloud.
You will find that Wikidata, is doing the very same thing, but with much more hardware at their disposal, since they have more funding than DBpedia, at this point in time.
(Actually, I don't think that WDQS uses that much hardware. Three machines right now, if I am not mistaken. Surely not an amount that one could not get at other organisations. Stas could comment.)
That basic response time expectations of users drives everything, all the time.
The key issue here is all about what method a given service providers chooses en route to addressing the expectations of users, as I've outlined above. Fundamentally, each service provider will use a variety of solution deployment techniques that boil down to:
Massive Server Clusters (sharded) and Proxies
Fast multi-threaded instances (no sharding but via replication
topologies) behind proxies (functioning as cops, so to speak).
Your "Simply doesn't work on the public Web" claim is subjective, I've told you that repeatedly. I am sure others will ultimately tell you the very same thing :)
Yes, I was about to say that in my last post. There are a lot of well-maintained, stable endpoints out there (not just at Wikimedia). LDF might have its uses, but it is probably too early to say if it could one day be a real alternative to the production-grade solutions that the community has produced already.
As a consumer of SPARQL services and developer of SPARQL-based applications, I find it very hard so far to see a use for LDF in its current state. It was clearly not built for interactive operation, but analytical queries do not seem to return results at all. Maybe the main usage could be federation, which indeed suffers from much of the same issues that LDF has to struggle with. Or maybe some ideas of LDF could feed into the federate query implementations of SPARQL stores, so as to combine the best of both worlds.
Best,
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Markus,
It was clearly not built for interactive operation
On the contrary, it is: imagine applications in the browser that react when each result comes in. Don't focus on the total time, focus on the results streaming in.
Web querying takes time, especially in a federated setting. The whole idea of TPF is that it's going to take time, so better to do something interactively while waiting.
The old query paradigm "ask, wait, do" is just not fit for the Web. For querying on the open Web, it is "ask, do as the results come in".
but analytical queries do not seem to return results at all.
TPF is not built for analysis at all. For heavy analyses of anything, I would suggest anyone to just download the data and host your own server. TPF is for live querying of multiple sources on the Web.
it is probably too early to say if it could one day be a real alternative to the production-grade solutions that the community has produced already.
Point taken. I just wished the community produced Web-grade solutions ;-)
Cheers,
Ruben
Hi Ruben,
On 21.12.2016 23:20, Ruben Verborgh wrote:
Hi Markus,
It was clearly not built for interactive operation
On the contrary, it is: imagine applications in the browser that react when each result comes in. Don't focus on the total time, focus on the results streaming in.
In the queries I tried, no results were streaming in whatsoever. I have developed one Web UI that uses SPARQL, and where results come in asynchronously after the initial page display (SQID). This behaviour makes sense, but only for the first 10 sec or so. After this, it is increasingly annoying to get late results that shift your page layout and to be unsure if more things will come later. It's maybe ok for a single-query UI with a Google-like interface, but as soon as you have a rich application with a more complex page layout (think BBC music), dynamic updates of the view (in arbitrary places) are definitely not appealing.
Web querying takes time, especially in a federated setting. The whole idea of TPF is that it's going to take time, so better to do something interactively while waiting.
The old query paradigm "ask, wait, do" is just not fit for the Web. For querying on the open Web, it is "ask, do as the results come in".
I don't think so. I believe users want to see a stable result, basically instantaneously. They don't want to witness the computation process in slow motion.
but analytical queries do not seem to return results at all.
TPF is not built for analysis at all. For heavy analyses of anything, I would suggest anyone to just download the data and host your own server. TPF is for live querying of multiple sources on the Web.
I see.
it is probably too early to say if it could one day be a real alternative to the production-grade solutions that the community has produced already.
Point taken. I just wished the community produced Web-grade solutions ;-)
I don't get what you mean by this. It sounds like a marketing term to me that has no serious meaning. What exactly is your problem with the current Wikidata or DBpedia SPARQL endpoints? Why are they not "Web grade" in your view?
Kind regards,
Markus
Hi!
In the queries I tried, no results were streaming in whatsoever. I have
I've tried the "antarctic rivers" query:
SELECT ?river ?riverLabel ?location WHERE { ?river wdt:P31/wdt:P279* wd:Q355304; wdt:P30 wd:Q51. OPTIONAL { ?river wdt:P625 ?location. } }
and after waiting for a very long time I got like 2 results. But that took very long. Also, I suspect this engine does not implement path queries right, e.g. I get queries like this:
https://query.wikidata.org/bigdata/ldf?subject=http%3A%2F%2Fwww.wikidata.org...
i.e. it tries to match Q7994940-*-Q355304. But instead of that, it should be trying Q7994940-P31-* and seeing whether Q355304 is there, and if not, keeping those for matching via P279.
Also, if I rewrite query like this:
SELECT ?river ?riverLabel ?location WHERE { ?river wdt:P30 wd:Q51; wdt:P31/wdt:P279* wd:Q355304 . OPTIONAL { ?river wdt:P625 ?location. } }
It takes it about 1000(!) s to find the first river, and it only found two that are directly P31, so I suspect path queries are not going to work that well. I wonder whether it does breadth-first search somehow?
Hi Stas,
Also, I suspect this engine does not implement path queries right
It doesn't implement them at all; it reads the first predicate and ignores the rest.
Right now, the engine implements: – BGPs – UNION – OPTIONAL – some FILTERs (https://github.com/LinkedDataFragments/Client.js/blob/master/lib/util/Sparql...)
Or, very simply put: everything needed to run the common SemWeb benchmarks such as BSBM, WatDiv, FedBench.
That's all we can afford as researchers, unfortunately :-) I hope to build more (and lots was built in my spare time), but my currency as a researcher is publishable results, and sadly, not many SPARQL features are.
Nonetheless, plenty of things to have fun with, and I'm happy to put feature requests on the list.
Best,
Ruben
PS There are plans for a new open-source SPARQL/JS engine powered by a larger development team, but that will take its time.
Ruben,
Looks like Jena is doing most of your heavy lifting in the Java client ? https://github.com/LinkedDataFragments/Client.Java/search?p=2&q=sparql&a...
or is there other code that I am missing somewhere in your repository ?
-Thad
Hi Thad,
Looks like Jena is doing most of your heavy lifting in the Java client ?
Absolutely, Jena is doing almost everything.
However, Jena is built with certain assumptions that don't hold for querying over the public Web as with TPF, so it doesn't work as optimally as for other backends.
Ruben
Hi!
That's all we can afford as researchers, unfortunately :-) I hope to build more (and lots was built in my spare time), but my currency as a researcher is publishable results, and sadly, not many SPARQL features are.
Understandable. And now since we have working TPF server and the prototype, I imagine people can build on that. We probably should not expect to use this as generic SPARQL engine (we already have that one) but I wonder what use cases can we improve with it.
PS There are plans for a new open-source SPARQL/JS engine powered by a larger development team, but that will take its time.
Interesting, glad to hear this.
Hi Ruben,
I understand the theory and motivation behind LDF (I did read your paper ;-) but appealing ideas do not always turn out to work well in practice. Of course all the issues I observe might be due to implementation. Maybe the general idea could still be made to work somehow. All I am doing is running queries and seeing what I get. Unfortunately, so far, I have not found many queries that return results at all (caveat: I cannot test very quickly since each query takes very long to run). What I was trying so far:
* "Brad Pitt" from the first page: 55sec in Firefox, but only 6sec in Chrome (!). Results are not really correct, but as you said that this should be fixed soon.
* "Rivers in Antarctica" (with all label fetching triples removed) This is really a rather direct query, with merely one OPTIONAL and three triple patterns in total. It ran for about 5 min before stopping and clearing the timer (no results shown).
* "Overall causes of death ranking" No luck there. terminated with no results. Timer resets sometimes for some reason.
* "Largest cities of the world" (with labels fetched by the plain SPARQL method as in the example query) This one got me a *server-side* timeout message after 5min of waiting; I would paste it here but the UI has reset itself before I could copy it.
* "Most eponymous mathematicians" (with all labels and the unsupported SAMPLE clause removed) I left that run for over an hour now. Still no result. The timer resets once in a while (it is at over 1900sec again as I write this). The query (with labels) took 3sec on Blazegraph.
Maybe I am just trying the wrong queries? Do you have a sample query from the Wikidata query samples that you would recommend as a good showcase?
Best regards,
Markus
Hi Markus,
(I did read your paper ;-)
Awesome :-)
Of course all the issues I observe might be due to implementation.
Or due to the kind of queries. Some queries will always be hard; with SPARQL endpoints, you pay the price in server cost; with TPF, you pay the price in query time and bandwidth. No silver bullet.
Maybe the general idea could still be made to work somehow.
Certainly, especially if "general idea" also includes looking for other alternatives beyond TPF, as recent research has done.
TPF is by no means a final solution. It's the start of a dialogue: look what happens with scalability if we make the server more lightweight.
- "Brad Pitt" from the first page: 55sec in Firefox, but only 6sec in Chrome (!).
That's due to caching. Clearing the cache will yield more similar results.
Results are not really correct, but as you said that this should be fixed soon.
I've published v2.0.4 of the client, that should fix things when deployed.
- "Rivers in Antarctica" (with all label fetching triples removed) This is really a rather direct query, with merely one OPTIONAL and three triple patterns in total. It ran for about 5 min before stopping and clearing the timer (no results shown).
Would need to look into that; unfortunately, my job mostly allows to create prototypes, we could really use devs (and budget) to develop more :-)
- "Largest cities of the world" (with labels fetched by the plain SPARQL method as in the example query) This one got me a *server-side* timeout message after 5min of waiting; I would paste it here but the UI has reset itself before I could copy it.
This query contains a blocking operator; those are indeed hard with TPF and will always be. To make those run fast, more powerful interfaces are useful (but they come of course at a cost).
Maybe I am just trying the wrong queries?
I think so, yes.
This also show our departure from typical SemWeb ideas: with TPF, we accept that some things are slow. Any query is possible (given a completely implemented engine), but some of them just take a lot of time. It has never been my goal to be able to evaluate any query fast; I'm particularly interested in making those queries fast that humans could also easily execute over Wikipedia/Wikidata. I.e., I see the query engine mostly as an improvement over manual Web searches, not as a trimmed-down version of a centralized SPARQL engine. We just use the same language. I aim to improve the Web, not SPARQL ;-)
Do you have a sample query from the Wikidata query samples that you would recommend as a good showcase?
I don't have too much experience with Wikidata, unfortunately, but I can give you some inspiration with DBpedia: http://client.linkeddatafragments.org/ Would be a good idea to port these queries.
Best,
Ruben
Hi Ruben.
On 21.12.2016 23:15, Ruben Verborgh wrote:
Hi Markus,
(I did read your paper ;-)
Awesome :-)
As editors, we have of course read all papers that have been published in that issue (and some others).
...
This also show our departure from typical SemWeb ideas: with TPF, we accept that some things are slow. Any query is possible (given a completely implemented engine), but some of them just take a lot of time.
Sorry, but this is really not what I am seeing. The queries I have tried all failed entirely. They were not slow, they completed computation with no results, time outs, or otherwise looped without any success.
...
I don't have too much experience with Wikidata, unfortunately, but I can give you some inspiration with DBpedia: http://client.linkeddatafragments.org/ Would be a good idea to port these queries.
No need to translate DBpedia queries. We have a large number (over 300) of user-written example queries for Wikidata:
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
Maybe you can find a good example there. I am afraid that blocking operators are rather essential in practice though, unless you use neither aggregates nor put a limit on the number of results (the latter makes sense only with ordered results, at least if you want paging).
Best,
Markus
Hi Markus,
Any query is possible (given a completely implemented engine), but some of them just take a lot of time.
Sorry, but this is really not what I am seeing. The queries I have tried all failed entirely. They were not slow, they completed computation with no results, time outs, or otherwise looped without any success.
"Given a completely implemented engine" is crucial there. I expect errors in the console or something.
It is trivial to prove that a TPF client can evaluate any SPARQL query with full completeness with respect to a finite number of sources, since one can obtain a full data dump with TPF. (It is equally trivial to prove that some will be very slow.)
No need to translate DBpedia queries.
Sure, they were just an example of the kind of queries that are interesting.
We have a large number (over 300) of user-written example queries for Wikidata:
I'll try to autorun them after the holidays and get timings for each of them.
I am afraid that blocking operators are rather essential in practice though
It all depends on the use cases and the constraints. Running things on the Web, as opposed to on a closed database, completely changes the setting.
The kind of queries that work great with TPF are the kind of queries that also work great on the open Web.
An example I've given in the past (http://www.slideshare.net/RubenVerborgh/the-lonesome-lod-cloud/47): a query with ORDER BY doesn't make sense on the Open Web. You'll just wait forever, because the first answer can't be given since we can never be sure there won't be another one before it.
So instead, don't have ORDER BY in the query; rather have an interface that dynamically reorders results as they stream in.
In the queries I tried, no results were streaming in whatsoever.
The default query streams (http://ldfclient.wmflabs.org/).
it is increasingly annoying to get late results that shift your page layout and to be unsure if more things will come later.
Or exciting, depending on how you look at it ;-)
I believe users want to see a stable result, basically instantaneously. They don't want to witness the computation process in slow motion.
Very well, but then they'll have to pay for it I'm afraid. I'm very happy with Wikimedia, Wikidata, Wikipedia, DBpedia, but these are unfortunately the exception.
We cannot expect to obtain multi-source information from the Web without paying for this in some way. It might be with privacy and diversity (e.g., Facebook), it might be with money (e.g., paying APIs), it might be with time (e.g., TPF).
So in absence of a business model for federated querying on the Web, I focus on reducing publishers' costs, such that we pay for information only with time. Which is what we did on the Web in the old days.
Point taken. I just wished the community produced Web-grade solutions ;-)
I don't get what you mean by this. It sounds like a marketing term to me that has no serious meaning. What exactly is your problem with the current Wikidata or DBpedia SPARQL endpoints? Why are they not "Web grade" in your view?
As I said above, Wikidata and DBpedia are the exceptions. There is some philanthropy behind these (and I'm happy for that). They are only Web-grade because somebody is willing to pay the bill without reaping immediate benefits from that. As we knows, that's not how the rest of the Web/world works.
Think about all the publishers that make their data freely available. They have been doing so for years and years, through websites and, later, Web APIs. Can we really expect them to provide an API that is more expressive/expensive than any other API out there? Can we really expect publishers who already give data away for free, to also pay for the entire cost of execution queries over that data? I don't think so.
I think the only feasible option for free/public data is to provide lightweight APIs. For some, "lightweight" might mean just a download; for others, like Wikipedia, it means one page per subject. But for none of them (except the philanthropy-driven ones), it will mean "let me execute a custom query". It's just not feasible on the Web.
So a Web-grade solution to me is a solution that publishers can offer at the same cost for which they're currently on the Web.
And of course, everything changes if data consumers carry the cost, either through privacy, money, or other means. But that's not primarily what I'm talking about.
Eventually, I want machines to access open information on the Web just like I can, from different sources, nicely combined. As long as they're faster than me, and I don't have to do it, I'm a happy man. The fast solutions come at a price, always.
Best,
Ruben
Hi Markus,
The example query related to films returns 55 results, while on the official endpoint it returns 128.
This turned out to be due to an incomplete implementation of LANGMATCHES, which I have now fixed in the query engine (https://github.com/LinkedDataFragments/Client.js/commit/09c829d07).
Thanks for reporting!
The sample query took 55s on my machine (producing only half of the results)
I blame latency here; the server seems to be located in San Francisco. Fortunately, and in contrast to SPARQL, this can be easily solved for TPF by using any of the many available HTTP cloud caching options. Alternatively, HTTP/2 can solve this.
Best,
Ruben
Hi!
I blame latency here; the server seems to be located in San Francisco.
WDQS servers are in Virginia, AFAIK: https://wikitech.wikimedia.org/wiki/Eqiad_cluster
Also, looks like I've underestimated the interest the demo endpoint has, so I'll do the following:
1. Create a separate VM for it so I won't break it unintentionally 2. Make it so it'd be easy to update it to latest code (right now it involves several manual steps and downtime) and actually update it.
That means also sometime today I'll have to take it down briefly to transfer the name to the new setup, but should be a very brief interruption. Will update once it's done.
Hi!
Also, looks like I've underestimated the interest the demo endpoint has, so I'll do the following:
- Create a separate VM for it so I won't break it unintentionally
- Make it so it'd be easy to update it to latest code (right now it
involves several manual steps and downtime) and actually update it.
That means also sometime today I'll have to take it down briefly to transfer the name to the new setup, but should be a very brief interruption. Will update once it's done.
OK< I deployed the latest version on http://ldfclient.wmflabs.org/ which should be a little more stable.
Hello,
I tried playing with it a bit and noticed an oddity in the JSON format: if the predicate and object are both left unspecified, "P_" keys will sometimes refer to full statement nodes and sometimes to truthy values. An example item with not too many statements where you can witness this is Q26536085:
curl -s -H 'Accept: application/ld+json' 'https://query.wikidata.org/bigdata/ldf?subject=http%3A//www.wikidata.org/ent...' | jq '.["@graph"] | .[] | select(.["@id"]=="wd:Q26536085")'
Right now, I get the following results:
"P1216": "wds:Q26536085-FCED904F-7F06-444A-84CE-0AFCE089C92C", "p:P131": { "@id": "wds:Q26536085-44302E43-6F33-4F4A-9783-BD631171BF43" }, "p:P1435": { "@id": "wds:Q26536085-01720F88-7A41-47C6-84FA-74F5E7538CDC" }, "P17": "wds:Q26536085-8D04E875-1CD8-4A53-BCC9-1B8591A4AE78", "p:P31": { "@id": "wds:Q26536085-6CBDDC3D-632A-41C3-8E3B-D9E1D0C103F7" }, "P625": "wds:Q26536085-49F66E15-7AE7-44D6-95CE-A4955734EA07", "wdt:P1216": "1243406", "P131": "wd:Q635457", "P1435": "wd:Q15700834", "wdt:P17": { "@id": "wd:Q145" }, "P31": "wd:Q3947", "wdt:P625": { "@type": "geo:wktLiteral", "@value": "Point(-2.027844 51.36333)" }
As you can see, the "P_" key sometimes refers to the full statement node ("wds:_") and sometimes to the direct, truthy value ("wd:Q_"). Where "P_" points to the statement node, there’s also a "wdt:P_" entry (sometimes pointing directly at a string containing the ID, sometimes pointing to an { "@id": _ } object); conversely, where "P_" points to the truthy value, there’s a "p:P_" entry to an { "@id": _ } object.
Is there any reason why different representations are chosen? Is this predictable? Is this a bug? Or is this just something you have to work around using the @context information if you want to use the JSON format? (The other data formats don’t seem to have this problem, since they don’t use unprefixed "P_" keys.)
Cheers, Lucas
On 21.12.2016 09:23, Léa Lacroix wrote:
Hello all,
The SPARQL endpoint we are running at http://query.wikidata.org has several measures in place in order to ensure it stays up and running and available for everyone, for example the 30 sec query timeout. This is necessary but also prevents some useful queries from being run. One way around this is Linked Data Fragments. It allows for some of the query computation to be done on the client-side instead of our server.
We have set this up now for testing and would appreciate your testing and feedback. You can find out more about Linked Data Fragments http://linkeddatafragments.org/concept/ and documentation for our installation https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Linked_Data_Fragments_endpoint. Also, you can see a demo of client-side SPARQL evaluation and LDF server usage here: http://ldfclient.wmflabs.org/
Please note - it's in no way a production service for anything, just a proof-of-concept deployment of LDF client. If you like how it works, you can get it from the source https://github.com/LinkedDataFragments/jQuery-Widget.js and deploy it on your own setup.
Feel free to ask Stas (Smalyshev (WMF)) for any further question!
-- Léa Lacroix Community Communication Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I tried playing with it a bit and noticed an oddity in the JSON format: if the predicate and object are both left unspecified, "P_" keys will sometimes refer to full statement nodes and sometimes to truthy values. An example item with not too many statements where you can witness this is Q26536085:
I think those are resolved in @context part to full URIs. The output is not just JSON, it's JSON-LD: https://en.wikipedia.org/wiki/JSON-LD
I'm not exactly sure why different ones are encoded in a different way though, I'll check into that.
Hi Lucas,
I tried playing with it a bit and noticed an oddity in the JSON format: if the predicate and object are both left unspecified, "P_" keys will sometimes refer to full statement nodes and sometimes to truthy values. An example item with not too many statements where you can witness this is Q26536085:
That might be a bug in the server.
Wikidata seems to use the experimental Java server, not the default and more tested Node.js server.
I reported it here: https://github.com/LinkedDataFragments/Server.Java/issues/41
Thanks,
Ruben
Hi!
Wikidata seems to use the experimental Java server, not the default and more tested Node.js server.
I reported it here: https://github.com/LinkedDataFragments/Server.Java/issues/41
After thinking a bit about it I *think* I've found the logic behind it: when it encounters first P-statement, it uses only suffix and puts the full URL in @context. However when it encounters second URI with the same suffix, it uses the prefix and then there's no need in @context. Since the statements can be encountered in any order in the DB, it's not consistent which of the two gets @context and which gets prefix.
The algorithm for parsing it would be checking if the key has :, if yes, resolve against prefix, if not - resolve against @context.
Hi!
Wikidata seems to use the experimental Java server, not the default and more tested Node.js server.
Yes, it is built on top of Java ldf server, since it integrates with Blazegraph which is of course Java.
I reported it here: https://github.com/LinkedDataFragments/Server.Java/issues/41
Thank you, ans thank you for responding so quickly! I'll keep an eye on it and will report any oddities.
Hi again,
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries. The current Wikidata SPARQL service handles about 20-30 queries per second on average. If you have this rate, and you expect that an LDF query is taking 30sec to answer on average (being optimistic here compared to my experience so far), you will have about 600-900 active queries at each moment, for a rate of 60,000 to 90,000 requests per second.
This seems to be a lot. It is actually approaching the order of magnitude we are seeing for Wikipedia (it's hard to compare these services; Wikipedia has mostly cache-served content too, but the average result size is larger). Wouldn't this load somehow lead to problems?
By the way, the query I had tried (streets named after women) has now finished after 1h and 20min (with the correct number of 320 results). If you have such "harder" [1] queries in the mix, the average time I estimated above might be too small. Such long runtimes also seem to increase the likeliness of connection errors and data inconsistencies (e.g., what if the database is updated during this time?). I got some failed requests during this query, too, but apparently they did not affect my result.
Cheers,
Markus
[1] Of course, this "hard" query takes a mere 1.3 sec on the SPARQL endpoint, so it is still very far from the 30sec timeout that LDF is aiming to go beyond.
On 21.12.2016 09:23, Léa Lacroix wrote:
Hello all,
The SPARQL endpoint we are running at http://query.wikidata.org has several measures in place in order to ensure it stays up and running and available for everyone, for example the 30 sec query timeout. This is necessary but also prevents some useful queries from being run. One way around this is Linked Data Fragments. It allows for some of the query computation to be done on the client-side instead of our server.
We have set this up now for testing and would appreciate your testing and feedback. You can find out more about Linked Data Fragments http://linkeddatafragments.org/concept/ and documentation for our installation https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Linked_Data_Fragments_endpoint. Also, you can see a demo of client-side SPARQL evaluation and LDF server usage here: http://ldfclient.wmflabs.org/
Please note - it's in no way a production service for anything, just a proof-of-concept deployment of LDF client. If you like how it works, you can get it from the source https://github.com/LinkedDataFragments/jQuery-Widget.js and deploy it on your own setup.
Feel free to ask Stas (Smalyshev (WMF)) for any further question!
-- Léa Lacroix Community Communication Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de http://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Markus,
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
That's indeed a very interesting question for all of us.
TPF trades server-side cost for speed and bandwidth. I usually say "still less bandwidth than YouTube HD", but that doesn't mean it's not a lot for some queries.
However, the _number_ of requests shouldn't be an issue with HTTP/2, but rather the combined _volume_ of these requests. (With HTTP 1.1 , this is different due to latency.)
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries.
Most certainly. The whole idea of TPF is based on caching. We earlier had a discussion on “Web-grade”, and designing for cache efficiency is an important part of that. Also see work on decentralized caching in this area (http://linkeddatafragments.org/publications/#caching).
Your initial question is answered for BSBM in the papers I pointed to earlier, where we see the very strong influence of caching (http://linkeddatafragments.org/publications/jws2016.pdf#page=21).
However, as the current tests of people here empirically seem to indicate, perhaps it's time for experiments with Wikidata also.
It's interesting to point out that the number of non-empty fragments per dataset is finite, so they can all be pre-cached; in contrast to an API such as SPARQL, where there is an infinite number of non-empty resources.
Wouldn't this load somehow lead to problems?
I don't think so, if caching is enabled. E.g., we can just put a CloudFlare cache in front of this, and it will work*.
By the way, the query I had tried (streets named after women) has now finished after 1h and 20min (with the correct number of 320 results). If you have such "harder" [1] queries in the mix, the average time I estimated above might be too small.
It all depends on the kind of queries indeed. I can think of thousands that will execute within 10s, and of thousands more that will take over an hour. Which queries are of the kind people would want to execute regularly, I don't know. We urgently need more apps using SPARQL to find out what practical use cases would look like.
For some things, TPF will always be too slow. That's fine, I suppose; we never intended TPF to be a final answer. It was rather an eye-opener to show what _is_ possible with lightweight interfaces, and to understand their limitations.
In that sense, we should be very careful with terminology. The answer to the question "can LDF scale?" is “yes”, because “Linked Data Fragments” is a conceptual framework for _all_ possible HTTP interfaces to RDF data, which includes data dumps, LD documents, TPF, and SPARQL endpoints. So the scaling is in the choice of interface here.
However, the answer to the question “can TPF scale?” is “it depends”, as we have to define precisely what scaling means. If “scaling” means that we can we can serve increasing numbers of clients just by putting more regular HTTP caches in the network, then “yes”. If “scaling” means that all queries have to be fast, the answer is definitely no.
The more interesting question, IMHO, is what interface should we use for which use case? These LDF interfaces existed before TPF: – data dump – LD documents – SPARQL endpoint – LDP API — … Since TPF, other LDF interfaces have been proposed: (http://linkeddatafragments.org/publications/#interfaces-beyond-triple-patter...) – TPF with membership filters – TPF with substring filtering – brTPF – …
None of these interfaces "scales better” (in general) than any other interface. Rather, it's a question of trade-offs. A SPARQL endpoint will scale better in query complexity, but will not scale with increasing numbers of clients by simply putting extra cache servers. brTPF will scale better in query complexity than TPF, but will have a strong influence on server cost.
Where I see my job as a researcher, is to find and document new trade-off balances, so that data publishers can decide, given their constraints, what the best is for them.
In the case of Wikidata, where there's already a SPARQL endpoint that is paid for and has reasonable uptime, I don't see much benefit for a TPF interface (unless it comes to federation, where TPF excels). However, several publishers don't have the budget for a SPARQL endpoint, so an interface like TPF might be more affordable, but then querying will be slower and bandwidth higher.
My vision for a future is one where publishers decide between a couple of different APIs/features they plan to offer, with clients dynamically discovering what each source supports. Then federated queries can be executed over SPARQL endpoints, LD documents, TPF, brTPF, etc. alike. It's really a question of diversity on the publisher side— no one size fits all—and clients being able to deal with that.
Such long runtimes also seem to increase the likeliness of connection errors and data inconsistencies (e.g., what if the database is updated during this time?).
With regard to consistency, I'd point at the TPF spec that demands some well-defined order (http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/), and also at the combination with Memento, which allows querying a specific (constant) version (https://ruben.verborgh.org/blog/2016/06/22/querying-history-with-linked-data...).
[1] Of course, this "hard" query takes a mere 1.3 sec on the SPARQL endpoint, so it is still very far from the 30sec timeout that LDF is aiming to go beyond.
I don't see overcoming the 30s limit as a prime use case of TPF. Sure, a fully implemented engine would be able to do it, but the query execution time will likely be unacceptable. I do see interesting use cases here for more-than-TPF interfaces.
Best,
Ruben
* A practical issue here is that CloudFlare doesn't support content-type-based-caching, so we would have to drop the content negotiation, but that is not an inherent technical limitation.
Ruben,
When you and I talked 2 years ago about my bandwidth concerns (moving the data out, instead of keeping it siloed in and costing providers)... I did research into technologies that help with that....BitTorrent was one of those, but specifically, some of the cool DHT proposals that help with the data discovery in the first place (such as the talk here of caching).
Caching could be done instead in a different way, perhaps even through DHT.
2 things immediately come to my mind that might help later, using different architecture through DHT instead.
1. Metadata files (could be small indexed Wikidata sets of Things) http://www.bittorrent.org/beps/bep_0009.html 2. Merkle Trees (pieces of pieces of Things or indexed sets) http://www.bittorrent.org/beps/bep_0030.html http://www.bittorrent.org/beps/bep_0000.html
This might be a good research investment for your teams in 2017. Where the world can share small index sets of Wikidata's data...i.e. All Software Companies, or All Men, etc.
In fact there's several good BitTorrent enhancement proposals that could benefit TPF and LDF client/servers potentially: http://www.bittorrent.org/beps/bep_0000.html
I wish you and your research teams the best in 2017. -Thad
Hi!
Most certainly. The whole idea of TPF is based on caching. We earlier had a discussion on “Web-grade”, and designing for cache efficiency is an important part of that. Also see work on decentralized caching in this area (http://linkeddatafragments.org/publications/#caching).
Thanks, the CyCLaDEs paper appears to be paywalled, but I certainly will read the other ones.
I don't think so, if caching is enabled. E.g., we can just put a CloudFlare cache in front of this, and it will work*.
I'm not sure how effective "naive" cache like CloudFlare would be when we're talking about billions of triples in various combinations. TPF requests are pretty elementary AFAIK and wouldn't that mean that CF would have to pretty much load every view on the graph engine index and store it in every format (assuming content negotiation issue is solved)?
Hi!
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
So I did some light benchmarking, and it looks like a single server can do 700 to 800 rps for TPF queries without significant rise in the load (which is understandable since it's almost all IO). Single request median time seems to be around 150ms and 99% time around 500ms. This quick test was done on 150 parallel threads.
Hi!
So I did some light benchmarking, and it looks like a single server can do 700 to 800 rps for TPF queries without significant rise in the load (which is understandable since it's almost all IO). Single request median time seems to be around 150ms and 99% time around 500ms. This quick test was done on 150 parallel threads.
I've re-run the benchmark with best-practices setting on 150 threads while randomizing the patterns I look up and it gave me over 1000 rps with average response time around 150 ms. The load was slightly higher but nowhere near the max.
So these are the parameters so far (remember that's for one server, so 3 servers ideally are supposed to do 3x of that).
On 23.12.2016 01:16, Stas Malyshev wrote:
Hi!
So I did some light benchmarking, and it looks like a single server can do 700 to 800 rps for TPF queries without significant rise in the load (which is understandable since it's almost all IO). Single request median time seems to be around 150ms and 99% time around 500ms. This quick test was done on 150 parallel threads.
I've re-run the benchmark with best-practices setting on 150 threads while randomizing the patterns I look up and it gave me over 1000 rps with average response time around 150 ms. The load was slightly higher but nowhere near the max.
So these are the parameters so far (remember that's for one server, so 3 servers ideally are supposed to do 3x of that).
Maybe I am slightly confused here. The number of 1000 requests per second seems to be too low if a single query leads to 100 rps, no? Or do you mean 1000K rps?
Of course adding more servers will help, like it also does with full-fledged SPARQL. But then there is no advantage compared to SPARQL. We know that we can do 20-30 SPARQL queries per second with two servers. If query execution times would be the same for TPF (!), then this would be 2000-3000 rps already. If this requires two servers as well, then there is no real advantage.
Best,
Markus
Hi!
Maybe I am slightly confused here. The number of 1000 requests per second seems to be too low if a single query leads to 100 rps, no? Or do you mean 1000K rps?
No, it can't be 1000K rps - that would require nanosecond-grade roundtrip times over network, which I don't think is possible. Anyway, this was more test of how the server handles the requests, and what we are getting from it is that single request is processed in around 100ms. Which, given you're allowed 5 connections, gives you about 50 rps tops per client, as I understand (at least those that hit the server, caching is different). I'm not sure we can improve server roundtrip time by much, given that the network is involved. Even if we somehow improve it by order of magnitude, still you don't get to even 1K rps without raising parallelism dramatically. Which right now we can't do, but in the future maybe - but then there are other limitations, since handling massively parallel connections on single server has its limits too - I'm not sure Java servlet model is good at such things. So we shouldn't expect super-high throughput from it I think.
Of course adding more servers will help, like it also does with full-fledged SPARQL. But then there is no advantage compared to SPARQL. We know that we can do 20-30 SPARQL queries per second with two servers.
That depends a lot on a kind of queries. Simple queries, yes, they essentially no different than TPF requests - the only difference is SPARQL parser, which doesn't take much in overall scheme of things. With TPF, all queries are simple, with SPARQL, decidedly not so :)
Hi!
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
I was wondering this too. Now, pattern fragment requests are much easier than SPARQL - as far as I can see, they go directly to the index, no query parsing, no plan building, no complex calculations, joins, etc. I haven't seen any noticeable change in the load when the tests were run yesterday (I've run several, and Markus did too).
That said, the answer to the question whether the server can handle the load required for practical LDF usage is a resounding "I don't know". So far I haven't seen any signs of it being problematic (with SPARQL it's pretty apparent when some queries are a problem, haven't seen anything like that). But we haven't seen any serious usage yet, and I'm not sure what usage patterns to look for, since it's entirely client-driven.
We do have caching in front of it, though I am not sure how effective it would be - after all, we're talking about 1.5 billion triples, each member of which can appear in a patterns, in various combinations and with various output formats, pagination etc. I am not entirely sure whether naive URL-based cache would do a lot here.
It is possible to have more horizontal-scale replication - i.e. adding servers - of course, at the cost of hardware which inevitably raises the question of budget -
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries. The current Wikidata SPARQL service handles about 20-30 queries per second on average. If you have this rate, and you expect that an LDF query is taking 30sec to answer on average (being optimistic here compared to my experience so far), you will have about 600-900 active queries at each moment, for a rate of 60,000 to 90,000 requests per second.
Note again that LDF queries would typically be very short in duration (since they produce only 100 items per page) and we still do have parallel connection limits :) But again, I'm not sure how it would behave under typical load, one of the reasons being I don't even know what typical load for such API is. I guess I'll have to monitor it carefully and see if there are signs of trouble and deal with it then.
I plan to do some light load testing just to have at least baseline measures, but until we know how the real usage looks like it all be guesswork I think.
(e.g., what if the database is updated during this time?). I got some failed requests during this query, too, but apparently they did not affect my result.
Some of the failures maybe because of parallel connection limits, I'm not sure how many parallel requests the JS client produces - it uses web workers but I haven't found how the parallelism is controlled.
Hi Stas,
It is possible to have more horizontal-scale replication - i.e. adding servers - of course, at the cost of hardware which inevitably raises the question of budget -
Since the number of non-empty TPFs per dataset is finite, just more caching should do, depending of course on the change frequency of the data.
Some of the failures maybe because of parallel connection limits, I'm not sure how many parallel requests the JS client produces - it uses web workers but I haven't found how the parallelism is controlled.
I believe browsers have a politeness limit; the client code itself is limited by default to maximum 10. Might be different with HTTP/2.
Ruben
Hi Stas, Ruben, Thad,
Collected answers:
* Bandwidth: I think this is not the main issue. The slowness is a bonus here, since it stretches the time during which the actual data is loaded. I am not sure but would guess that my 1h20min query has not received much more than 100MB of data.
* Caching: Ruben's arguments about this being a "finite set of URLs to cache" is misleading, and, I believe, not true. Here is why:
(1) It is not true since we need to cache triple patterns taken from queries, not patterns from the data. For example, there are infinitely many different strings and for each you could look for all Wikidata entities with this label (in fact, this is a real use case that is quite common for WDQS, since people check for labels that are not in the data when doing database integration). Even if there are zero results, this must either be computed or returned from cache.
(2) It is misleading since the number of triple patterns is astronomic even when restricting to URLs and literals that exist in Wikidata: -> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion) (829 is the current number of item properties in Wikidata) Each pattern may lead to many URLs, since the results are split into pages if there are many. The total number is therefore slightly bigger (not much, since most of the patterns have no results -- but this still needs to be cached!).
So the situation of "what can I cache in practice" vs. "what would I need to cache in theory" is the same as in SPARQL: there is far too much to cache to ever get anywhere near completeness.
* Can caching still help? In a real-world setting, you would like to update your data once in a while. The real question for caching is which percentage of your requests are served from cache *during this time*. In the calculation I made, there were around 100K requests per minute. If we take this as the interval between data updates (as in WDS now), then chances seem pretty slim that a request repeats. One could hope that there are only few requests and that they therefore repeat a lot. My test suggests that this is not so likely, since my single query generated hundred thousands of requests, which I assume were all different (why fetch the same data twice in one query execution?).
In any case, the relative benefit of caching will surely be much smaller for TPF (where you cache an easy-to-compute but large answer) than for SPARQL endpoints (where you cache a hard-to-compute but often small answer). Also note that the total number of SPARQL queries one needs to cache is much (much!) smaller than the number of TPF replies one would cache for answering them.
* Are request numbers no problem because requests are simple? This really depends on the database implementation behind the service. Surely it helps if the database has less work. But there is also the notion of "thrashing" a database, where you destroy performance not by difficult queries but by large numbers of simple ones. Moreover, other components of the server could also be challenged, such as logging or other types of filtering that happen before you even get to the DB.
I realise that I am being rather critical here, but I think we must cross-check if it is realistic to assume that this approach could become useful for Wikidata at some point before investing much more staff time into it. I have supported this work as a research work, and I still do, but the impression we got so far in the "wild" is really not very encouraging.
Best regards,
Markus
On 22.12.2016 18:30, Stas Malyshev wrote:
Hi!
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
I was wondering this too. Now, pattern fragment requests are much easier than SPARQL - as far as I can see, they go directly to the index, no query parsing, no plan building, no complex calculations, joins, etc. I haven't seen any noticeable change in the load when the tests were run yesterday (I've run several, and Markus did too).
That said, the answer to the question whether the server can handle the load required for practical LDF usage is a resounding "I don't know". So far I haven't seen any signs of it being problematic (with SPARQL it's pretty apparent when some queries are a problem, haven't seen anything like that). But we haven't seen any serious usage yet, and I'm not sure what usage patterns to look for, since it's entirely client-driven.
We do have caching in front of it, though I am not sure how effective it would be - after all, we're talking about 1.5 billion triples, each member of which can appear in a patterns, in various combinations and with various output formats, pagination etc. I am not entirely sure whether naive URL-based cache would do a lot here.
It is possible to have more horizontal-scale replication - i.e. adding servers - of course, at the cost of hardware which inevitably raises the question of budget -
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries. The current Wikidata SPARQL service handles about 20-30 queries per second on average. If you have this rate, and you expect that an LDF query is taking 30sec to answer on average (being optimistic here compared to my experience so far), you will have about 600-900 active queries at each moment, for a rate of 60,000 to 90,000 requests per second.
Note again that LDF queries would typically be very short in duration (since they produce only 100 items per page) and we still do have parallel connection limits :) But again, I'm not sure how it would behave under typical load, one of the reasons being I don't even know what typical load for such API is. I guess I'll have to monitor it carefully and see if there are signs of trouble and deal with it then.
I plan to do some light load testing just to have at least baseline measures, but until we know how the real usage looks like it all be guesswork I think.
(e.g., what if the database is updated during this time?). I got some failed requests during this query, too, but apparently they did not affect my result.
Some of the failures maybe because of parallel connection limits, I'm not sure how many parallel requests the JS client produces - it uses web workers but I haven't found how the parallelism is controlled.
Hi!
common for WDQS, since people check for labels that are not in the data when doing database integration). Even if there are zero results, this must either be computed or returned from cache.
Exactly, people do searches like "does this random string look like a movie name" against both Wikipedia and Wikidata.
(2) It is misleading since the number of triple patterns is astronomic even when restricting to URLs and literals that exist in Wikidata: -> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion)
Note that we also have non-item nodes - like statements and value nodes - that can come into play if we want to deal with non-current data (such as list of US presidents), units, coordinates, etc. So the above is undercounting the real dataset size. That's why I am somewhat skeptical about how useful naive request caching (one that is not specialized for this use case) would be.
- Are request numbers no problem because requests are simple? This
really depends on the database implementation behind the service. Surely it helps if the database has less work. But there is also the notion of "thrashing" a database, where you destroy performance not by difficult queries but by large numbers of simple ones. Moreover, other components of the server could also be challenged, such as logging or other types of filtering that happen before you even get to the DB.
This is definitely a factor, so we'll be watching the usage/load numbers and see if we notice anything dangerous.
I realise that I am being rather critical here, but I think we must cross-check if it is realistic to assume that this approach could become useful for Wikidata at some point before investing much more staff time into it. I have supported this work as a research work, and I still do,
I don't think we plan to invest more time than we already did into it. The endpoint is up now, we don't really plan to do anything additional with it - it's for the users now to see if it's useful. We'll be watching to see whether it is not overtaxing resources and not dragging SPARQL part down, but otherwise for now that's all the investment we're doing for now. If we come up with some use case helpful for us we would then implement it but nothing planned yet.
On 23.12.2016 01:29, Stas Malyshev wrote: ...
I don't think we plan to invest more time than we already did into it. The endpoint is up now, we don't really plan to do anything additional with it - it's for the users now to see if it's useful. We'll be watching to see whether it is not overtaxing resources and not dragging SPARQL part down, but otherwise for now that's all the investment we're doing for now. If we come up with some use case helpful for us we would then implement it but nothing planned yet.
Sounds good. I support this as a community project, and I remain curious about the results, just as long as it does not affect the production usage of Wikidata.
Markus
Hi Markus,
I am not sure but would guess that my 1h20min query has not received much more than 100MB of data.
That might be possible (we'd need to verify), but then this means the other time was spent computing, which shows the query plan or execution was highly inefficient. So not an inherent limitation of TPF then, but rather an implementation problem that is fixable.
- Caching: Ruben's arguments about this being a "finite set of URLs to cache" is misleading, and, I believe, not true.
Note that my argument was about a finite number of _non-empty_ fragments.
For example, there are infinitely many different strings and for each you could look for all Wikidata entities with this label
True, but empty is a very easy case, as there are many shortcuts to take here.
(2) It is misleading since the number of triple patterns is astronomic even when restricting to URLs and literals that exist in Wikidata: -> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion)
Another calculation is that each triple belongs to 2^3 = 8 fragments (variable or constant for each component) so that is just 11 billion non-empty fragments. Taking the gzipped version of https://query.wikidata.org/bigdata/ldf (2kb) as an average, it would be a total of 22 terabytes to cache the first page of every fragment on disk. Big, but not insurmountable. 70$ buys 2TB nowadays.
So the situation of "what can I cache in practice" vs. "what would I need to cache in theory" is the same as in SPARQL: there is far too much to cache to ever get anywhere near completeness.
But we don't need to cache everything; just already the most asked fragments. Caching is about lowering load for the server, not about completely removing that load.
A 9-month study of DBpedia's TPFs showed that 60% of fragments were cached [1]— and that's with the expiration set to only 1 hour.
I don't have the statistics for a SPARQL endpoint, but I would imagine the hit rate to be much lower, given that a hit only occurs when the exact same query is asked with the exact same syntax.
- Can caching still help? In a real-world setting, you would like to update your data once in a while. The real question for caching is which percentage of your requests are served from cache *during this time*.
One could: – be relaxed about caching, and allow an hour of delay; – do active cache invalidation.
In any case, the relative benefit of caching will surely be much smaller for TPF (where you cache an easy-to-compute but large answer) than for SPARQL endpoints (where you cache a hard-to-compute but often small answer).
Disagree, because the benefit strongly depends on the probability that a given request will be asked again, which is much higher for TPF due to lower granularity.
Also note that the total number of SPARQL queries one needs to cache is much (much!) smaller than the number of TPF replies one would cache for answering them.
We were talking about real-world settings; what is the chance that two independent parties ask the exact same SPARQL query (presuming they don't just try the examples)?
- Are request numbers no problem because requests are simple? This really depends on the database implementation behind the service.
Very true. Since TPF requests are so simple, you don't even need an actual DBMS to serve them; most of our servers use simple HDT files.
But there is also the notion of "thrashing" a database, where you destroy performance not by difficult queries but by large numbers of simple ones.
For this, I'm happy with Stas' measurements; and caching is also going to help out.
Moreover, other components of the server could also be challenged, such as logging or other types of filtering that happen before you even get to the DB.
True, you'd want to keep those as light as possible.
I realise that I am being rather critical here
That's great and rest assured that it is challenging for me to find replies, so the criticism is certainly fair!
we must cross-check if it is realistic to assume that this approach could become useful for Wikidata
The Wikidata case might be peculiar, in that there seem to be no budget issues for hosting a public SPARQL endpoint. So if that's the case, TPF might have low added value, except for federated queries.
I have supported this work as a research work, and I still do
Thanks, and I believe we have some valuable lessons already for further research on this. It might be that WIkidata is better suited with more-than-TPF approaches.
but the impression we got so far in the "wild" is really not very encouraging.
True, I'm disappointed myself by the hickups mentioned in this thread. However, we must distinguish between inherent TPF limitations and limitations of the current client implementation. I hope to revisit this with an updated client after some months, to better assess what is really a TPF limitation and what is not.
Best,
Ruben
[1] http://linkeddatafragments.org/publications/eswc2015-workshops.pdf
Hoi, Indeed 70$ buys 2TB nowadays but who is the public? Have you done a breakdown on how many people have so much spare capacity for this?
Really is Wikidata for institutional users or is it for a community. My biggest criticism on Wikidata is that it is severely underfunded and that its users get the raw end of the stick. Wikidata would be much more active and relevant when its users and their needs get proper attention and not only student and community projects to serve its needs. Thanks, GerardM
On 23 December 2016 at 01:40, Ruben Verborgh Ruben.Verborgh@ugent.be wrote:
Hi Markus,
I am not sure but would guess that my 1h20min query has not received
much more than 100MB of data.
That might be possible (we'd need to verify), but then this means the other time was spent computing, which shows the query plan or execution was highly inefficient. So not an inherent limitation of TPF then, but rather an implementation problem that is fixable.
- Caching: Ruben's arguments about this being a "finite set of URLs to
cache" is misleading, and, I believe, not true.
Note that my argument was about a finite number of _non-empty_ fragments.
For example, there are infinitely many different strings and for each
you could look for all Wikidata entities with this label
True, but empty is a very easy case, as there are many shortcuts to take here.
(2) It is misleading since the number of triple patterns is astronomic
even when restricting to URLs and literals that exist in Wikidata:
-> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion)
Another calculation is that each triple belongs to 2^3 = 8 fragments (variable or constant for each component) so that is just 11 billion non-empty fragments. Taking the gzipped version of https://query.wikidata.org/bigdata/ldf (2kb) as an average, it would be a total of 22 terabytes to cache the first page of every fragment on disk. Big, but not insurmountable. 70$ buys 2TB nowadays.
So the situation of "what can I cache in practice" vs. "what would I
need to cache in theory" is the same as in SPARQL: there is far too much to cache to ever get anywhere near completeness.
But we don't need to cache everything; just already the most asked fragments. Caching is about lowering load for the server, not about completely removing that load.
A 9-month study of DBpedia's TPFs showed that 60% of fragments were cached [1]— and that's with the expiration set to only 1 hour.
I don't have the statistics for a SPARQL endpoint, but I would imagine the hit rate to be much lower, given that a hit only occurs when the exact same query is asked with the exact same syntax.
- Can caching still help? In a real-world setting, you would like to
update your data once in a while. The real question for caching is which percentage of your requests are served from cache *during this time*.
One could: – be relaxed about caching, and allow an hour of delay; – do active cache invalidation.
In any case, the relative benefit of caching will surely be much smaller
for TPF (where you cache an easy-to-compute but large answer) than for SPARQL endpoints (where you cache a hard-to-compute but often small answer).
Disagree, because the benefit strongly depends on the probability that a given request will be asked again, which is much higher for TPF due to lower granularity.
Also note that the total number of SPARQL queries one needs to cache is
much (much!) smaller than the number of TPF replies one would cache for answering them.
We were talking about real-world settings; what is the chance that two independent parties ask the exact same SPARQL query (presuming they don't just try the examples)?
- Are request numbers no problem because requests are simple? This
really depends on the database implementation behind the service.
Very true. Since TPF requests are so simple, you don't even need an actual DBMS to serve them; most of our servers use simple HDT files.
But there is also the notion of "thrashing" a database, where you
destroy performance not by difficult queries but by large numbers of simple ones.
For this, I'm happy with Stas' measurements; and caching is also going to help out.
Moreover, other components of the server could also be challenged, such
as logging or other types of filtering that happen before you even get to the DB.
True, you'd want to keep those as light as possible.
I realise that I am being rather critical here
That's great and rest assured that it is challenging for me to find replies, so the criticism is certainly fair!
we must cross-check if it is realistic to assume that this approach
could become useful for Wikidata
The Wikidata case might be peculiar, in that there seem to be no budget issues for hosting a public SPARQL endpoint. So if that's the case, TPF might have low added value, except for federated queries.
I have supported this work as a research work, and I still do
Thanks, and I believe we have some valuable lessons already for further research on this. It might be that WIkidata is better suited with more-than-TPF approaches.
but the impression we got so far in the "wild" is really not very
encouraging.
True, I'm disappointed myself by the hickups mentioned in this thread. However, we must distinguish between inherent TPF limitations and limitations of the current client implementation. I hope to revisit this with an updated client after some months, to better assess what is really a TPF limitation and what is not.
Best,
Ruben
[1] http://linkeddatafragments.org/publications/eswc2015-workshops.pdf
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata