Hi Markus,
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
That's indeed a very interesting question for all of us.
TPF trades server-side cost for speed and bandwidth. I usually say "still less bandwidth than YouTube HD", but that doesn't mean it's not a lot for some queries.
However, the _number_ of requests shouldn't be an issue with HTTP/2, but rather the combined _volume_ of these requests. (With HTTP 1.1 , this is different due to latency.)
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries.
Most certainly. The whole idea of TPF is based on caching. We earlier had a discussion on “Web-grade”, and designing for cache efficiency is an important part of that. Also see work on decentralized caching in this area (http://linkeddatafragments.org/publications/#caching).
Your initial question is answered for BSBM in the papers I pointed to earlier, where we see the very strong influence of caching (http://linkeddatafragments.org/publications/jws2016.pdf#page=21).
However, as the current tests of people here empirically seem to indicate, perhaps it's time for experiments with Wikidata also.
It's interesting to point out that the number of non-empty fragments per dataset is finite, so they can all be pre-cached; in contrast to an API such as SPARQL, where there is an infinite number of non-empty resources.
Wouldn't this load somehow lead to problems?
I don't think so, if caching is enabled. E.g., we can just put a CloudFlare cache in front of this, and it will work*.
By the way, the query I had tried (streets named after women) has now finished after 1h and 20min (with the correct number of 320 results). If you have such "harder" [1] queries in the mix, the average time I estimated above might be too small.
It all depends on the kind of queries indeed. I can think of thousands that will execute within 10s, and of thousands more that will take over an hour. Which queries are of the kind people would want to execute regularly, I don't know. We urgently need more apps using SPARQL to find out what practical use cases would look like.
For some things, TPF will always be too slow. That's fine, I suppose; we never intended TPF to be a final answer. It was rather an eye-opener to show what _is_ possible with lightweight interfaces, and to understand their limitations.
In that sense, we should be very careful with terminology. The answer to the question "can LDF scale?" is “yes”, because “Linked Data Fragments” is a conceptual framework for _all_ possible HTTP interfaces to RDF data, which includes data dumps, LD documents, TPF, and SPARQL endpoints. So the scaling is in the choice of interface here.
However, the answer to the question “can TPF scale?” is “it depends”, as we have to define precisely what scaling means. If “scaling” means that we can we can serve increasing numbers of clients just by putting more regular HTTP caches in the network, then “yes”. If “scaling” means that all queries have to be fast, the answer is definitely no.
The more interesting question, IMHO, is what interface should we use for which use case? These LDF interfaces existed before TPF: – data dump – LD documents – SPARQL endpoint – LDP API — … Since TPF, other LDF interfaces have been proposed: (http://linkeddatafragments.org/publications/#interfaces-beyond-triple-patter...) – TPF with membership filters – TPF with substring filtering – brTPF – …
None of these interfaces "scales better” (in general) than any other interface. Rather, it's a question of trade-offs. A SPARQL endpoint will scale better in query complexity, but will not scale with increasing numbers of clients by simply putting extra cache servers. brTPF will scale better in query complexity than TPF, but will have a strong influence on server cost.
Where I see my job as a researcher, is to find and document new trade-off balances, so that data publishers can decide, given their constraints, what the best is for them.
In the case of Wikidata, where there's already a SPARQL endpoint that is paid for and has reasonable uptime, I don't see much benefit for a TPF interface (unless it comes to federation, where TPF excels). However, several publishers don't have the budget for a SPARQL endpoint, so an interface like TPF might be more affordable, but then querying will be slower and bandwidth higher.
My vision for a future is one where publishers decide between a couple of different APIs/features they plan to offer, with clients dynamically discovering what each source supports. Then federated queries can be executed over SPARQL endpoints, LD documents, TPF, brTPF, etc. alike. It's really a question of diversity on the publisher side— no one size fits all—and clients being able to deal with that.
Such long runtimes also seem to increase the likeliness of connection errors and data inconsistencies (e.g., what if the database is updated during this time?).
With regard to consistency, I'd point at the TPF spec that demands some well-defined order (http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/), and also at the combination with Memento, which allows querying a specific (constant) version (https://ruben.verborgh.org/blog/2016/06/22/querying-history-with-linked-data...).
[1] Of course, this "hard" query takes a mere 1.3 sec on the SPARQL endpoint, so it is still very far from the 30sec timeout that LDF is aiming to go beyond.
I don't see overcoming the 30s limit as a prime use case of TPF. Sure, a fully implemented engine would be able to do it, but the query execution time will likely be unacceptable. I do see interesting use cases here for more-than-TPF interfaces.
Best,
Ruben
* A practical issue here is that CloudFlare doesn't support content-type-based-caching, so we would have to drop the content negotiation, but that is not an inherent technical limitation.