Re: [Wikidata] SPARQL endpoint caching

17 Feb 2016

Hi,

some random comments:

(1) Are there any concrete cases of applications that need 
"super-up-to-date" results (where 120 sec is too old)? I do not 
currently run or foresee to run any such application. Moreover, I think 
that you have to allow for at least 60sec for an update to make it into 
the RDF database, so 120sec seems to be already very close to the 
freshness you could get at all. My applications would be fine with 
getting updates every 10min.

(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query 
costs to re-run and it could even know if a query is affected by a data 
update (a cache might still be the same as a current result even after 
many data changes). Having several caching layers is useful, but the 
more elaborate (query-structure dependent) caching strategies should 
maybe be left to the database.

(3) I suspect queries to follow a long-tailish distribution (probably 
with some impurities), where a few queries are very frequent but most 
queries are rather rare. If this is true, then the caching should cut of 
the peak at the high end: the queries that run >100 or >1000 times per 
hour. This will already work well with a relatively short caching time. 
For example, with a 120sec caching time, a query can run at most 30 
times per hour. You could go to 300 sec as well for at most 12 times per 
hour. Any query that you cannot afford to run 12 times per hour might 
have problems with or without a cache.

(4) In addition to balancing regular use as in (3), caching can also be 
vital to catch sudden burst of activity (a trending new Web application, 
a crawler that goes wild on another site, a developer who tries a new 
tool). Again, short caching intervals will be effective for this.

(5) I don't think you can get much benefit in caching costly, 
low-frequency queries. You would need a much longer caching interval to 
catch them, and would still only use the cache once or twice per query.

The points (3)-(5) are based on guessing. As Magnus said, some analysis 
could help to confirm or refute this. On the other hand, caching should 
not just focus on current usage patterns only, but consider a bit what 
could happen in the future.

Cheers,

Markus

On 16.02.2016 23:47, Stas Malyshev wrote:
...
  Hi!

 With Wikidata Query Service usage raising and more use cases being
 found, it is time to consider caching infrastructure for results, since
 queries are expensive. One of the questions I would like to solicit
 feedback on is the following:

 Should we have default SPARQL endpoint cached or uncached? If cached,
 which default cache duration would be good for most users? The cache, of
 course, applies to the results of the same (identical) query only.
 Please also note the following is not an implementation plan, but rather
 an opinion poll, whatever we end up deciding we will have an
 announcement with actual plan before we do it.

 Also, whichever default we choose, there should be a possibility to get
 both cached and uncached results. The question is when you access the
 endpoint with no options, which one would it be. So possible variants are:

 1. query.wikidata.org/sparql is uncached, to get cached result you use
 something like query.wikidata.org/sparql?cached=120 to get result no
 older than 120 seconds ago.
 PRO: least surprise for default users.
 CON: relies on goodwill of tool writers, if somebody doesn't know about
 cache option and uses the same query heavily, we would have to ask them
 to use the parameter.

 2. query.wikidata.org/sparql is cached for short duration (e.g. 1
 minute) by default, if you'd like fresh result, you do something like
 query.wikidata.org/sparql?cached=0. If you're fine with older result,
 you can use query.wikidata.org/sparql?cached=3600 and get cached result
 if it's still in cache but by default you never get result older than 1
 minute. This of course assuming Varnish magic can do this, if not, the
 scheme has to be amended.
 PRO: performance improvement while keeping default results reasonably fresh
 CON: it is not obvious that result is not the freshest data but can be
 stale, so if you update something in wikidata and query again within
 minute, you can be surprised

 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
 default, if you'd like fresher result you do something like
 query.wikidata.org/sparql?cache=120 to get result no older than 2
 minutes, or cache=0 if you want uncached one.
 PRO: best performance improvement for most queries, works well with
 queries that display data that rarely changes, such as lists, etc.
 CON: for people not knowing about cache option, in may be rather
 confusing to not be able to get up-to-date results.

 So we'd like to hear - especially from current SPARQL endpoint users -
 what do you think about these and which would work for you?

 Also, for the users of the WDQS GUI - provided we have cached and
 uncached options, which one the GUI should return by default? Should it
 be always uncached? Performance there is not a major question - the
 traffic to the GUI is pretty low - but rather convenience. Of course, if
 you run cached query from GUI and the data in cache, you can get results
 much faster for some queries. OTOH, it may be important in many cases to
 be able to access actual content up-to-date, not the cached version.

 I also created a poll: https://phabricator.wikimedia.org/V8
 so please feel free to vote for your favorite option.

 OK, this letter is long enough already so I'll stop here and wait to
 hear what everybody's thinking.

 Thanks in advance,

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] SPARQL endpoint caching