Re: [Wikidata] SPARQL endpoint caching

17 Feb 2016

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Am 17.02.2016 um 09:54 schrieb Katie Filbert:
...
  On Wed, Feb 17, 2016 at 9:39 AM, Markus Krötzsch 
 &lt;markus(a)semantic-mediawiki.org
 <mailto:markus@semantic-mediawiki.org>> wrote:

 On 17.02.2016 08:16, Stas Malyshev wrote:

 Hi!

 (2) Shouldn't BlazeGraph do the caching (too)? It knows how much a
 query costs to re-run and it could even know if a query is affected
 by a data

 BlazeGraph does a lot of caching, but it's limited by the memory 
 and it AFAIK does not do whole query caching (like mysql does, for 
 example) - which means if you run two big queries one after
 another, the latter could remove from cache what the former put
 there. Its caching, AFAIK, is on much lower level. Which is helpful
 too since different queries share a lot of underlying data, but not
 exactly our case here.

 update (a cache might still be the same as a current result even
 after many data changes). Having several caching layers is useful, 
 but the more elaborate (query-structure dependent) caching 
 strategies should maybe be left to the database.

 I don't think Blazegraph does anything like resolving changes to 
 see if query results changed, that sound like pretty hard thing to
 do in triple store. You can manually store specific query result
 AFAIK but that's just form of writing data as I understand and may
 not be very scalable.

 Yes, in general this would be extremely hard. There are some easy 
 cases one could catch, but it is not clear how effective this
 would be for our load. I am just saying we should not try to build
 a query-aware caching strategy that would better be done on a lower
 level.

 The points (3)-(5) are based on guessing. As Magnus said, some
 analysis could help to confirm or refute this. On the other hand, 
 caching should not just focus on current usage patterns only, but
 consider a bit what could happen in the future.

 Well, again the problem is that one use case that I think
 absolutely needs caching - namely, exporting data to graphs, maps,
 etc. deployed on wiki pages - is also the one not implemented yet
 because we don't have cache (not only, but one of the things we
 need) so we've got chicken and egg problem here :) Of course, we
 can just choose something now based on educated guess and change it
 later if it works badly. That's probably what we'll do.

 Yes, it is hard to predict what load this will create. The caching 
 levels around Wikipedia prevent re-computation of the page on most 
 page views, so maybe there would not actually be very many
 repeated requests for the same query coming from tOne option could
 be a dedicated caching layer just for such wiki uses. On the one
 hand, the set of all embedded queries is known upfront (so, in
 contrast to other uses, you already know which queries will be
 asked). On the other hand, users may wish to do a forced refresh
 his side. The main danger again seems to be bursts of activity (a
 page getting a lot of edits in a short time, and each edit
 invalidates the ParserCache and requires refetching query results).
 On the positive side, this specific usage of WDQS can pass its own
 caching parameters (which we can control), so if there is a caching
 layer in place, one could react to issues on short notice by being
 more conservative there than for other queries.

 The interesting thing about the wiki-embedding usage is that it 
 requires quick propagation of changes. Scenario: a user visits a 
 Wikipedia page with a map created from a query; the user finds an 
 outdated item on the map; she goes to Wikidata to fix it, and 
 refreshes (edits) the page to see the change. Now if she is too 
 quick, the change will not have made it into the query result yet
 -- she could try in a minute or so. However, if we have a long
 caching period, her first query will have populated the cache and
 prevent the update from showing for the maximal amount of time (the
 whole cache period). This seems like a case where long caching
 would be rather bad for user experience.

 I think it would be nice if having a graph with query on a page
 does not too much adversely affect the time it takes to save a
 page. (e.g. if running the query takes 20 seconds..., and instead
 reuse cached query results)  And not have such usage kill /
 overwhelm the query service, is also important.

 If we incorporate entity usage or something like that, then maybe
 that could be used to handle cache invalidation in cases something
 used in a query changed.

 Cheers, Katie 

I believe that this could be solved most easily by not letting queries
to be entered directly on wiki pages but have separate pages for them
where one can examine the result and see the last run of the query and
trigger a re-run. The page on the wiki embedding the query would then
be independent from the query service but only use the query result
stored somewhere in the wiki.

This seems to be a very transparent way for the user to see the status
of the query because it provides a separate page to "manage" the
query. One could maybe also specify automatic run-intervals etc.

Best regards
Bene
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJWxFeSAAoJECB5/QQloPPvFkoP/RIJ7uSn6wEQYVfzbtzpGkwX
xRbjIkZOprybHwSFQwVwnR96uyjI+i9H7zhbWhS7Cy5Qp4DalmlTwaGy0etkFNHv
OV/OtYEE0C7mBVtTYDW9G7XR483lxwb8FWgGk0Ia3ddp4n8xSsVSDLVOmLySyVP/
sQdMQyyAxzNmdE+fgMow1HWrT8jHpKZoiUNbsWBlMGF6sUHxCMM322urdM9oWXqc
Nt+FJ8WSUqYAUfWLQ7XmHsg/8lIbPK9FC7j79uvSCCj6xeO0qJU5iJ8AqpIG6ErK
HgeokPin8kjMJd9EmyxcPKICLCFDuyYvXRbC22ycPOh7gggiiNJoz+Om8NvE5Qs6
CvSuWf5BKxx3tXub5mNgOtuJYmO4wykRcONiPb+9DgEOaXOP+FvZhIRwOkTqry8k
GWUkhvnCVia1hZqpUURifTjUa8eyyY9IhQCGDo/Kyw7Qn5OFaBPme8V7fB2iCj8d
5TGfPkpXfEQkhejKH9Rz5Ic60yD+FN/tHdq/e3u3L88up4onuYao/Lk4W5XRS/GX
pi/w2UYo4djIsUUow41KjD+gPuDAEomOLu0A6Yzk1JnYwLT8A66xNKx2sgR7MEDu
7HpBGC6Lp8L8ZmbJ0E/3QM3S+3V336+eoTadBRavOWxRRnV7EtYSdmo9xAaS52+e
rZ+Bcn3fgIl5feA1o377
=m+mG
-----END PGP SIGNATURE----- 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] SPARQL endpoint caching