Hi Stas, Ruben, Thad,
Collected answers:
* Bandwidth: I think this is not the main issue. The slowness is a bonus
here, since it stretches the time during which the actual data is
loaded. I am not sure but would guess that my 1h20min query has not
received much more than 100MB of data.
* Caching: Ruben's arguments about this being a "finite set of URLs to
cache" is misleading, and, I believe, not true. Here is why:
(1) It is not true since we need to cache triple patterns taken from
queries, not patterns from the data. For example, there are infinitely
many different strings and for each you could look for all Wikidata
entities with this label (in fact, this is a real use case that is quite
common for WDQS, since people check for labels that are not in the data
when doing database integration). Even if there are zero results, this
must either be computed or returned from cache.
(2) It is misleading since the number of triple patterns is astronomic
even when restricting to URLs and literals that exist in Wikidata:
-> Number of Item-Property-Item triple patterns with one variable:
23M x 829 x 2 + 23M x 23M = 529 * 10^12
(0.5 Quadrillion)
(829 is the current number of item properties in Wikidata)
Each pattern may lead to many URLs, since the results are split into
pages if there are many. The total number is therefore slightly bigger
(not much, since most of the patterns have no results -- but this still
needs to be cached!).
So the situation of "what can I cache in practice" vs. "what would I
need to cache in theory" is the same as in SPARQL: there is far too much
to cache to ever get anywhere near completeness.
* Can caching still help? In a real-world setting, you would like to
update your data once in a while. The real question for caching is which
percentage of your requests are served from cache *during this time*. In
the calculation I made, there were around 100K requests per minute. If
we take this as the interval between data updates (as in WDS now), then
chances seem pretty slim that a request repeats. One could hope that
there are only few requests and that they therefore repeat a lot. My
test suggests that this is not so likely, since my single query
generated hundred thousands of requests, which I assume were all
different (why fetch the same data twice in one query execution?).
In any case, the relative benefit of caching will surely be much smaller
for TPF (where you cache an easy-to-compute but large answer) than for
SPARQL endpoints (where you cache a hard-to-compute but often small
answer). Also note that the total number of SPARQL queries one needs to
cache is much (much!) smaller than the number of TPF replies one would
cache for answering them.
* Are request numbers no problem because requests are simple? This
really depends on the database implementation behind the service. Surely
it helps if the database has less work. But there is also the notion of
"thrashing" a database, where you destroy performance not by difficult
queries but by large numbers of simple ones. Moreover, other components
of the server could also be challenged, such as logging or other types
of filtering that happen before you even get to the DB.
I realise that I am being rather critical here, but I think we must
cross-check if it is realistic to assume that this approach could become
useful for Wikidata at some point before investing much more staff time
into it. I have supported this work as a research work, and I still do,
but the impression we got so far in the "wild" is really not very
encouraging.
Best regards,
Markus
On 22.12.2016 18:30, Stas Malyshev wrote:
Hi!
A thing I was wondering about while testing LDF
is how this type of
service might behave under load. In the tests I am doing, my single
browser issues several 100,000 requests for a single query, at an
average rate close to 100 requests per second. This is one user.
I was wondering this too. Now, pattern fragment requests are much easier
than SPARQL - as far as I can see, they go directly to the index, no
query parsing, no plan building, no complex calculations, joins, etc. I
haven't seen any noticeable change in the load when the tests were run
yesterday (I've run several, and Markus did too).
That said, the answer to the question whether the server can handle the
load required for practical LDF usage is a resounding "I don't know". So
far I haven't seen any signs of it being problematic (with SPARQL it's
pretty apparent when some queries are a problem, haven't seen anything
like that). But we haven't seen any serious usage yet, and I'm not sure
what usage patterns to look for, since it's entirely client-driven.
We do have caching in front of it, though I am not sure how effective it
would be - after all, we're talking about 1.5 billion triples, each
member of which can appear in a patterns, in various combinations and
with various output formats, pagination etc. I am not entirely sure
whether naive URL-based cache would do a lot here.
It is possible to have more horizontal-scale replication - i.e. adding
servers - of course, at the cost of hardware which inevitably raises the
question of budget -
It seems one might need a sizeable
caching/replication/sharding
infrastructure to cope with this load as soon as more than a few users
issue manual queries. The current Wikidata SPARQL service handles about
20-30 queries per second on average. If you have this rate, and you
expect that an LDF query is taking 30sec to answer on average (being
optimistic here compared to my experience so far), you will have about
600-900 active queries at each moment, for a rate of 60,000 to 90,000
requests per second.
Note again that LDF queries would typically be very short in duration
(since they produce only 100 items per page) and we still do have
parallel connection limits :) But again, I'm not sure how it would
behave under typical load, one of the reasons being I don't even know
what typical load for such API is. I guess I'll have to monitor it
carefully and see if there are signs of trouble and deal with it then.
I plan to do some light load testing just to have at least baseline
measures, but until we know how the real usage looks like it all be
guesswork I think.
(e.g., what if the database is updated during
this time?). I got some
failed requests during this query, too, but apparently they did not
affect my result.
Some of the failures maybe because of parallel connection limits, I'm
not sure how many parallel requests the JS client produces - it uses web
workers but I haven't found how the parallelism is controlled.