I have to say that I am dubious.
How often does *exactly* the same query get run within 2 minutes ?
Does the same query ever get run ?
The first thing to do, surely, is to create a hash for each query, (or
better, perhaps, something like a tinyurl so then the lookup is
reversible, record a timestamp for that hash each time the query is run,
and then see even over a period of a month how many (if any) queries are
being re-run, and if so how often.
I can imagine it's possible that particular tracking queries might be
re-run (but probably (a) not every two minutes; and (b) not wanting the
same result as last time).
Also perhaps queries with a published link might get re-run -- eg if
somebody posts the link for a query-generated graph on twitter that gets
a lot of re-tweets. (Or even just if Lydia posts it in the news of the
week).
For queries like that, caching might well make sense (and save the
server a potential slashdotting).
I'd guess there's probably only a very few queries like that though.
Possibly it's only worth caching a set of results if the same query has
*already* been requested within the last n minutes ?
-- James
On 16/02/2016 22:47, Stas Malyshev wrote:
> Hi!
>
> With Wikidata Query Service usage raising and more use cases being
> found, it is time to consider caching infrastructure for results, since
> queries are expensive. One of the questions I would like to solicit
> feedback on is the following:
>
> Should we have default SPARQL endpoint cached or uncached? If cached,
> which default cache duration would be good for most users? The cache, of
> course, applies to the results of the same (identical) query only.
> Please also note the following is not an implementation plan, but rather
> an opinion poll, whatever we end up deciding we will have an
> announcement with actual plan before we do it.
>
> Also, whichever default we choose, there should be a possibility to get
> both cached and uncached results. The question is when you access the
> endpoint with no options, which one would it be. So possible variants are:
>
> 1. query.wikidata.org/sparql is uncached, to get cached result you use
> something like query.wikidata.org/sparql?cached=120 to get result no
> older than 120 seconds ago.
> PRO: least surprise for default users.
> CON: relies on goodwill of tool writers, if somebody doesn't know about
> cache option and uses the same query heavily, we would have to ask them
> to use the parameter.
>
> 2. query.wikidata.org/sparql is cached for short duration (e.g. 1
> minute) by default, if you'd like fresh result, you do something like
> query.wikidata.org/sparql?cached=0. If you're fine with older result,
> you can use query.wikidata.org/sparql?cached=3600 and get cached result
> if it's still in cache but by default you never get result older than 1
> minute. This of course assuming Varnish magic can do this, if not, the
> scheme has to be amended.
> PRO: performance improvement while keeping default results reasonably fresh
> CON: it is not obvious that result is not the freshest data but can be
> stale, so if you update something in wikidata and query again within
> minute, you can be surprised
>
> 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
> default, if you'd like fresher result you do something like
> query.wikidata.org/sparql?cache=120 to get result no older than 2
> minutes, or cache=0 if you want uncached one.
> PRO: best performance improvement for most queries, works well with
> queries that display data that rarely changes, such as lists, etc.
> CON: for people not knowing about cache option, in may be rather
> confusing to not be able to get up-to-date results.
>
> So we'd like to hear - especially from current SPARQL endpoint users -
> what do you think about these and which would work for you?
>
> Also, for the users of the WDQS GUI - provided we have cached and
> uncached options, which one the GUI should return by default? Should it
> be always uncached? Performance there is not a major question - the
> traffic to the GUI is pretty low - but rather convenience. Of course, if
> you run cached query from GUI and the data in cache, you can get results
> much faster for some queries. OTOH, it may be important in many cases to
> be able to access actual content up-to-date, not the cached version.
>
> I also created a poll: https://phabricator.wikimedia.org/V8
> so please feel free to vote for your favorite option.
>
> OK, this letter is long enough already so I'll stop here and wait to
> hear what everybody's thinking.
>
> Thanks in advance,
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata