Re: [Wikidata] Can LDF scale?

22 Dec 2016

Hi Stas, Ruben, Thad,

Collected answers:

* Bandwidth: I think this is not the main issue. The slowness is a bonus 
here, since it stretches the time during which the actual data is 
loaded. I am not sure but would guess that my 1h20min query has not 
received much more than 100MB of data.

* Caching: Ruben's arguments about this being a "finite set of URLs to 
cache" is misleading, and, I believe, not true. Here is why:

(1) It is not true since we need to cache triple patterns taken from 
queries, not patterns from the data. For example, there are infinitely 
many different strings and for each you could look for all Wikidata 
entities with this label (in fact, this is a real use case that is quite 
common for WDQS, since people check for labels that are not in the data 
when doing database integration). Even if there are zero results, this 
must either be computed or returned from cache.

(2) It is misleading since the number of triple patterns is astronomic 
even when restricting to URLs and literals that exist in Wikidata:
   -> Number of Item-Property-Item triple patterns with one variable:
	23M x 829 x 2 + 23M x 23M = 529 * 10^12
		(0.5 Quadrillion)
(829 is the current number of item properties in Wikidata)
Each pattern may lead to many URLs, since the results are split into 
pages if there are many. The total number is therefore slightly bigger 
(not much, since most of the patterns have no results -- but this still 
needs to be cached!).

So the situation of "what can I cache in practice" vs. "what would I 
need to cache in theory" is the same as in SPARQL: there is far too much 
to cache to ever get anywhere near completeness.

* Can caching still help? In a real-world setting, you would like to 
update your data once in a while. The real question for caching is which 
percentage of your requests are served from cache *during this time*. In 
the calculation I made, there were around 100K requests per minute. If 
we take this as the interval between data updates (as in WDS now), then 
chances seem pretty slim that a request repeats. One could hope that 
there are only few requests and that they therefore repeat a lot. My 
test suggests that this is not so likely, since my single query 
generated hundred thousands of requests, which I assume were all 
different (why fetch the same data twice in one query execution?).

In any case, the relative benefit of caching will surely be much smaller 
for TPF (where you cache an easy-to-compute but large answer) than for 
SPARQL endpoints (where you cache a hard-to-compute but often small 
answer). Also note that the total number of SPARQL queries one needs to 
cache is much (much!) smaller than the number of TPF replies one would 
cache for answering them.

* Are request numbers no problem because requests are simple? This 
really depends on the database implementation behind the service. Surely 
it helps if the database has less work. But there is also the notion of 
"thrashing" a database, where you destroy performance not by difficult 
queries but by large numbers of simple ones. Moreover, other components 
of the server could also be challenged, such as logging or other types 
of filtering that happen before you even get to the DB.

I realise that I am being rather critical here, but I think we must 
cross-check if it is realistic to assume that this approach could become 
useful for Wikidata at some point before investing much more staff time 
into it. I have supported this work as a research work, and I still do, 
but the impression we got so far in the "wild" is really not very 
encouraging.

Best regards,

Markus

On 22.12.2016 18:30, Stas Malyshev wrote:
...
  Hi!

  A thing I was wondering about while testing LDF
is how this type of
 service might behave under load. In the tests I am doing, my single
 browser issues several 100,000 requests for a single query, at an
 average rate close to 100 requests per second. This is one user. 
 I was wondering this too. Now, pattern fragment requests are much easier
 than SPARQL - as far as I can see, they go directly to the index, no
 query parsing, no plan building, no complex calculations, joins, etc. I
 haven't seen any noticeable change in the load when the tests were run
 yesterday (I've run several, and Markus did too).

 That said, the answer to the question whether the server can handle the
 load required for practical LDF usage is a resounding "I don't know". So
 far I haven't seen any signs of it being problematic (with SPARQL it's
 pretty apparent when some queries are a problem, haven't seen anything
 like that). But we haven't seen any serious usage yet, and I'm not sure
 what usage patterns to look for, since it's entirely client-driven.

 We do have caching in front of it, though I am not sure how effective it
 would be - after all, we're talking about 1.5 billion triples, each
 member of which can appear in a patterns, in various combinations and
 with various output formats, pagination etc. I am not entirely sure
 whether naive URL-based cache would do a lot here.

 It is possible to have more horizontal-scale replication - i.e. adding
 servers - of course, at the cost of hardware which inevitably raises the
 question of budget -

  It seems one might need a sizeable
caching/replication/sharding
 infrastructure to cope with this load as soon as more than a few users
 issue manual queries. The current Wikidata SPARQL service handles about
 20-30 queries per second on average. If you have this rate, and you
 expect that an LDF query is taking 30sec to answer on average (being
 optimistic here compared to my experience so far), you will have about
 600-900 active queries at each moment, for a rate of 60,000 to 90,000
 requests per second. 
 Note again that LDF queries would typically be very short in duration
 (since they produce only 100 items per page) and we still do have
 parallel connection limits :) But again, I'm not sure how it would
 behave under typical load, one of the reasons being I don't even know
 what typical load for such API is. I guess I'll have to monitor it
 carefully and see if there are signs of trouble and deal with it then.

 I plan to do some light load testing just to have at least baseline
 measures, but until we know how the real usage looks like it all be
 guesswork I think.

  (e.g., what if the database is updated during
this time?). I got some
 failed requests during this query, too, but apparently they did not
 affect my result. 
 Some of the failures maybe because of parallel connection limits, I'm
 not sure how many parallel requests the JS client produces - it uses web
 workers but I haven't found how the parallelism is controlled.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Can LDF scale?