Re: [Wikidata] SPARQL endpoint caching

17 Feb 2016

On 17.02.2016 08:16, Stas Malyshev wrote:
...
  Hi!

  (2) Shouldn't BlazeGraph do the caching
(too)? It knows how much a query
 costs to re-run and it could even know if a query is affected by a data 
 BlazeGraph does a lot of caching, but it's limited by the memory and it
 AFAIK does not do whole query caching (like mysql does, for example) -
 which means if you run two big queries one after another, the latter
 could remove from cache what the former put there. Its caching, AFAIK,
 is on much lower level. Which is helpful too since different queries
 share a lot of underlying data, but not exactly our case here.

  update (a cache might still be the same as a
current result even after
 many data changes). Having several caching layers is useful, but the
 more elaborate (query-structure dependent) caching strategies should
 maybe be left to the database. 
 I don't think Blazegraph does anything like resolving changes to see if
 query results changed, that sound like pretty hard thing to do in triple
 store. You can manually store specific query result AFAIK but that's
 just form of writing data as I understand and may not be very scalable. 
Yes, in general this would be extremely hard. There are some easy cases 
one could catch, but it is not clear how effective this would be for our 
load. I am just saying we should not try to build a query-aware caching 
strategy that would better be done on a lower level.

...

  The points (3)-(5) are based on guessing. As
Magnus said, some analysis
 could help to confirm or refute this. On the other hand, caching should
 not just focus on current usage patterns only, but consider a bit what
 could happen in the future. 
 Well, again the problem is that one use case that I think absolutely
 needs caching - namely, exporting data to graphs, maps, etc. deployed on
 wiki pages - is also the one not implemented yet because we don't have
 cache (not only, but one of the things we need) so we've got chicken and
 egg problem here :) Of course, we can just choose something now based on
 educated guess and change it later if it works badly. That's probably
 what we'll do. 
Yes, it is hard to predict what load this will create. The caching 
levels around Wikipedia prevent re-computation of the page on most page 
views, so maybe there would not actually be very many repeated requests 
for the same query coming from tOne option could be a dedicated caching 
layer just for such wiki uses. On the one hand, the set of all embedded 
queries is known upfront (so, in contrast to other uses, you already 
know which queries will be asked). On the other hand, users may wish to 
do a forced refresh his side. The main danger again seems to be bursts 
of activity (a page getting a lot of edits in a short time, and each 
edit invalidates the ParserCache and requires refetching query results). 
On the positive side, this specific usage of WDQS can pass its own 
caching parameters (which we can control), so if there is a caching 
layer in place, one could react to issues on short notice by being more 
conservative there than for other queries.

The interesting thing about the wiki-embedding usage is that it requires 
quick propagation of changes. Scenario: a user visits a Wikipedia page 
with a map created from a query; the user finds an outdated item on the 
map; she goes to Wikidata to fix it, and refreshes (edits) the page to 
see the change. Now if she is too quick, the change will not have made 
it into the query result yet -- she could try in a minute or so. 
However, if we have a long caching period, her first query will have 
populated the cache and prevent the update from showing for the maximal 
amount of time (the whole cache period). This seems like a case where 
long caching would be rather bad for user experience.

Markus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] SPARQL endpoint caching