Re: [Wikidata] SPARQL endpoint caching

16 Feb 2016

I agree, we should look at some actual traffic to see how many queries
/could/ be cached in a 2/5/10/60 min window. Maybe remove the example
queries from those numbers, to separate the "production" and testing usage.
Also, look at query runtime; if only "cheap" queries would be cached, there
is no point in caching.

If caching would lead to significant savings, option 2 sounds sensible.
Some people will get upset if their results aren't up-to-the-second, and
being able to shift the blame at "server defaults" would be convenient ;-)

Option 3 sounds bad, because everyone and their cousin will just add an
override to their tools, to prevent hours-old data to be served to the
surprised users. WDQ has a ~10-15 min lag, that's about as much as people
can stomach.

Once you run a query, you know both the runtime and the result size. Maybe
expensive queries with a huge result set could be cached longer by default,
and cheap/small queries not at all? If you expect your recent Wikidata edit
to change the results from 3 to 4, you should see that ASAP; if the change
would be 50.000 to 50.001, it seems less critical somehow.

On Tue, Feb 16, 2016 at 11:19 PM James Heald &lt;j.heald(a)ucl.ac.uk&gt; wrote:

...
  I have to say that I am dubious.

 How often does *exactly* the same query get run within 2 minutes ?

 Does the same query ever get run ?

 The first thing to do, surely, is to create a hash for each query, (or
 better, perhaps, something like a tinyurl so then the lookup is
 reversible, record a timestamp for that hash each time the query is run,
 and then see even over a period of a month how many (if any) queries are
 being re-run, and if so how often.

 I can imagine it's possible that particular tracking queries might be
 re-run (but probably (a) not every two minutes; and (b) not wanting the
 same result as last time).

 Also perhaps queries with a published link might get re-run -- eg if
 somebody posts the link for a query-generated graph on twitter that gets
 a lot of re-tweets.  (Or even just if Lydia posts it in the news of the
 week).

 For queries like that, caching might well make sense (and save the
 server a potential slashdotting).

 I'd guess there's probably only a very few queries like that though.

 Possibly it's only worth caching a set of results if the same query has
 *already* been requested within the last n minutes ?

    -- James

 On 16/02/2016 22:47, Stas Malyshev wrote:
  Hi!

 With Wikidata Query Service usage raising and more use cases being
 found, it is time to consider caching infrastructure for results, since
 queries are expensive. One of the questions I would like to solicit
 feedback on is the following:

 Should we have default SPARQL endpoint cached or uncached? If cached,
 which default cache duration would be good for most users? The cache, of
 course, applies to the results of the same (identical) query only.
 Please also note the following is not an implementation plan, but rather
 an opinion poll, whatever we end up deciding we will have an
 announcement with actual plan before we do it.

 Also, whichever default we choose, there should be a possibility to get
 both cached and uncached results. The question is when you access the
 endpoint with no options, which one would it be. So possible variants  are:

 1. query.wikidata.org/sparql is uncached, to get cached result you use
 something like query.wikidata.org/sparql?cached=120 to get result no
 older than 120 seconds ago.
 PRO: least surprise for default users.
 CON: relies on goodwill of tool writers, if somebody doesn't know about
 cache option and uses the same query heavily, we would have to ask them
 to use the parameter.

 2. query.wikidata.org/sparql is cached for short duration (e.g. 1
 minute) by default, if you'd like fresh result, you do something like
 query.wikidata.org/sparql?cached=0. If you're fine with older result,
 you can use query.wikidata.org/sparql?cached=3600 and get cached result
 if it's still in cache but by default you never get result older than 1
 minute. This of course assuming Varnish magic can do this, if not, the
 scheme has to be amended.
 PRO: performance improvement while keeping default results reasonably  fresh
  CON: it is not obvious that result is not the
freshest data but can be
 stale, so if you update something in wikidata and query again within
 minute, you can be surprised

 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
 default, if you'd like fresher result you do something like
 query.wikidata.org/sparql?cache=120 to get result no older than 2
 minutes, or cache=0 if you want uncached one.
 PRO: best performance improvement for most queries, works well with
 queries that display data that rarely changes, such as lists, etc.
 CON: for people not knowing about cache option, in may be rather
 confusing to not be able to get up-to-date results.

 So we'd like to hear - especially from current SPARQL endpoint users -
 what do you think about these and which would work for you?

 Also, for the users of the WDQS GUI - provided we have cached and
 uncached options, which one the GUI should return by default? Should it
 be always uncached? Performance there is not a major question - the
 traffic to the GUI is pretty low - but rather convenience. Of course, if
 you run cached query from GUI and the data in cache, you can get results
 much faster for some queries. OTOH, it may be important in many cases to
 be able to access actual content up-to-date, not the cached version.

 I also created a poll: https://phabricator.wikimedia.org/V8
 so please feel free to vote for your favorite option.

 OK, this letter is long enough already so I'll stop here and wait to
 hear what everybody's thinking.

 Thanks in advance,

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] SPARQL endpoint caching