Hi Stas, Ruben, Thad,
Collected answers:
* Bandwidth: I think this is not the main issue. The slowness is a bonus here, since it stretches the time during which the actual data is loaded. I am not sure but would guess that my 1h20min query has not received much more than 100MB of data.
* Caching: Ruben's arguments about this being a "finite set of URLs to cache" is misleading, and, I believe, not true. Here is why:
(1) It is not true since we need to cache triple patterns taken from queries, not patterns from the data. For example, there are infinitely many different strings and for each you could look for all Wikidata entities with this label (in fact, this is a real use case that is quite common for WDQS, since people check for labels that are not in the data when doing database integration). Even if there are zero results, this must either be computed or returned from cache.
(2) It is misleading since the number of triple patterns is astronomic even when restricting to URLs and literals that exist in Wikidata: -> Number of Item-Property-Item triple patterns with one variable: 23M x 829 x 2 + 23M x 23M = 529 * 10^12 (0.5 Quadrillion) (829 is the current number of item properties in Wikidata) Each pattern may lead to many URLs, since the results are split into pages if there are many. The total number is therefore slightly bigger (not much, since most of the patterns have no results -- but this still needs to be cached!).
So the situation of "what can I cache in practice" vs. "what would I need to cache in theory" is the same as in SPARQL: there is far too much to cache to ever get anywhere near completeness.
* Can caching still help? In a real-world setting, you would like to update your data once in a while. The real question for caching is which percentage of your requests are served from cache *during this time*. In the calculation I made, there were around 100K requests per minute. If we take this as the interval between data updates (as in WDS now), then chances seem pretty slim that a request repeats. One could hope that there are only few requests and that they therefore repeat a lot. My test suggests that this is not so likely, since my single query generated hundred thousands of requests, which I assume were all different (why fetch the same data twice in one query execution?).
In any case, the relative benefit of caching will surely be much smaller for TPF (where you cache an easy-to-compute but large answer) than for SPARQL endpoints (where you cache a hard-to-compute but often small answer). Also note that the total number of SPARQL queries one needs to cache is much (much!) smaller than the number of TPF replies one would cache for answering them.
* Are request numbers no problem because requests are simple? This really depends on the database implementation behind the service. Surely it helps if the database has less work. But there is also the notion of "thrashing" a database, where you destroy performance not by difficult queries but by large numbers of simple ones. Moreover, other components of the server could also be challenged, such as logging or other types of filtering that happen before you even get to the DB.
I realise that I am being rather critical here, but I think we must cross-check if it is realistic to assume that this approach could become useful for Wikidata at some point before investing much more staff time into it. I have supported this work as a research work, and I still do, but the impression we got so far in the "wild" is really not very encouraging.
Best regards,
Markus
On 22.12.2016 18:30, Stas Malyshev wrote:
Hi!
A thing I was wondering about while testing LDF is how this type of service might behave under load. In the tests I am doing, my single browser issues several 100,000 requests for a single query, at an average rate close to 100 requests per second. This is one user.
I was wondering this too. Now, pattern fragment requests are much easier than SPARQL - as far as I can see, they go directly to the index, no query parsing, no plan building, no complex calculations, joins, etc. I haven't seen any noticeable change in the load when the tests were run yesterday (I've run several, and Markus did too).
That said, the answer to the question whether the server can handle the load required for practical LDF usage is a resounding "I don't know". So far I haven't seen any signs of it being problematic (with SPARQL it's pretty apparent when some queries are a problem, haven't seen anything like that). But we haven't seen any serious usage yet, and I'm not sure what usage patterns to look for, since it's entirely client-driven.
We do have caching in front of it, though I am not sure how effective it would be - after all, we're talking about 1.5 billion triples, each member of which can appear in a patterns, in various combinations and with various output formats, pagination etc. I am not entirely sure whether naive URL-based cache would do a lot here.
It is possible to have more horizontal-scale replication - i.e. adding servers - of course, at the cost of hardware which inevitably raises the question of budget -
It seems one might need a sizeable caching/replication/sharding infrastructure to cope with this load as soon as more than a few users issue manual queries. The current Wikidata SPARQL service handles about 20-30 queries per second on average. If you have this rate, and you expect that an LDF query is taking 30sec to answer on average (being optimistic here compared to my experience so far), you will have about 600-900 active queries at each moment, for a rate of 60,000 to 90,000 requests per second.
Note again that LDF queries would typically be very short in duration (since they produce only 100 items per page) and we still do have parallel connection limits :) But again, I'm not sure how it would behave under typical load, one of the reasons being I don't even know what typical load for such API is. I guess I'll have to monitor it carefully and see if there are signs of trouble and deal with it then.
I plan to do some light load testing just to have at least baseline measures, but until we know how the real usage looks like it all be guesswork I think.
(e.g., what if the database is updated during this time?). I got some failed requests during this query, too, but apparently they did not affect my result.
Some of the failures maybe because of parallel connection limits, I'm not sure how many parallel requests the JS client produces - it uses web workers but I haven't found how the parallelism is controlled.