I agree, we should look at some actual traffic to see how many queries /could/ be cached in a 2/5/10/60 min window. Maybe remove the example queries from those numbers, to separate the "production" and testing usage. Also, look at query runtime; if only "cheap" queries would be cached, there is no point in caching.

If caching would lead to significant savings, option 2 sounds sensible. Some people will get upset if their results aren't up-to-the-second, and being able to shift the blame at "server defaults" would be convenient ;-)

Option 3 sounds bad, because everyone and their cousin will just add an override to their tools, to prevent hours-old data to be served to the surprised users. WDQ has a ~10-15 min lag, that's about as much as people can stomach.

Once you run a query, you know both the runtime and the result size. Maybe expensive queries with a huge result set could be cached longer by default, and cheap/small queries not at all? If you expect your recent Wikidata edit to change the results from 3 to 4, you should see that ASAP; if the change would be 50.000 to 50.001, it seems less critical somehow.

On Tue, Feb 16, 2016 at 11:19 PM James Heald <j.heald@ucl.ac.uk> wrote:

I have to say that I am dubious.

How often does *exactly* the same query get run within 2 minutes ?

Does the same query ever get run ?

The first thing to do, surely, is to create a hash for each query, (or
better, perhaps, something like a tinyurl so then the lookup is
reversible, record a timestamp for that hash each time the query is run,
and then see even over a period of a month how many (if any) queries are
being re-run, and if so how often.

I can imagine it's possible that particular tracking queries might be
re-run (but probably (a) not every two minutes; and (b) not wanting the
same result as last time).

Also perhaps queries with a published link might get re-run -- eg if
somebody posts the link for a query-generated graph on twitter that gets
a lot of re-tweets. (Or even just if Lydia posts it in the news of the
week).

For queries like that, caching might well make sense (and save the
server a potential slashdotting).

I'd guess there's probably only a very few queries like that though.

Possibly it's only worth caching a set of results if the same query has
*already* been requested within the last n minutes ?

-- James

On 16/02/2016 22:47, Stas Malyshev wrote:
> Hi!
>
> With Wikidata Query Service usage raising and more use cases being
> found, it is time to consider caching infrastructure for results, since
> queries are expensive. One of the questions I would like to solicit
> feedback on is the following:
>
> Should we have default SPARQL endpoint cached or uncached? If cached,
> which default cache duration would be good for most users? The cache, of
> course, applies to the results of the same (identical) query only.
> Please also note the following is not an implementation plan, but rather
> an opinion poll, whatever we end up deciding we will have an
> announcement with actual plan before we do it.
>
> Also, whichever default we choose, there should be a possibility to get
> both cached and uncached results. The question is when you access the
> endpoint with no options, which one would it be. So possible variants are:
>
> 1. query.wikidata.org/sparql is uncached, to get cached result you use
> something like query.wikidata.org/sparql?cached=120 to get result no
> older than 120 seconds ago.
> PRO: least surprise for default users.
> CON: relies on goodwill of tool writers, if somebody doesn't know about
> cache option and uses the same query heavily, we would have to ask them
> to use the parameter.
>
> 2. query.wikidata.org/sparql is cached for short duration (e.g. 1
> minute) by default, if you'd like fresh result, you do something like
> query.wikidata.org/sparql?cached=0. If you're fine with older result,
> you can use query.wikidata.org/sparql?cached=3600 and get cached result
> if it's still in cache but by default you never get result older than 1
> minute. This of course assuming Varnish magic can do this, if not, the
> scheme has to be amended.
> PRO: performance improvement while keeping default results reasonably fresh
> CON: it is not obvious that result is not the freshest data but can be
> stale, so if you update something in wikidata and query again within
> minute, you can be surprised
>
> 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
> default, if you'd like fresher result you do something like
> query.wikidata.org/sparql?cache=120 to get result no older than 2
> minutes, or cache=0 if you want uncached one.
> PRO: best performance improvement for most queries, works well with
> queries that display data that rarely changes, such as lists, etc.
> CON: for people not knowing about cache option, in may be rather
> confusing to not be able to get up-to-date results.
>
> So we'd like to hear - especially from current SPARQL endpoint users -
> what do you think about these and which would work for you?
>
> Also, for the users of the WDQS GUI - provided we have cached and
> uncached options, which one the GUI should return by default? Should it
> be always uncached? Performance there is not a major question - the
> traffic to the GUI is pretty low - but rather convenience. Of course, if
> you run cached query from GUI and the data in cache, you can get results
> much faster for some queries. OTOH, it may be important in many cases to
> be able to access actual content up-to-date, not the cached version.
>
> I also created a poll: https://phabricator.wikimedia.org/V8
> so please feel free to vote for your favorite option.
>
> OK, this letter is long enough already so I'll stop here and wait to
> hear what everybody's thinking.
>
> Thanks in advance,
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata