Hi!
With Wikidata Query Service usage raising and more use cases being found, it is time to consider caching infrastructure for results, since queries are expensive. One of the questions I would like to solicit feedback on is the following:
Should we have default SPARQL endpoint cached or uncached? If cached, which default cache duration would be good for most users? The cache, of course, applies to the results of the same (identical) query only. Please also note the following is not an implementation plan, but rather an opinion poll, whatever we end up deciding we will have an announcement with actual plan before we do it.
Also, whichever default we choose, there should be a possibility to get both cached and uncached results. The question is when you access the endpoint with no options, which one would it be. So possible variants are:
1. query.wikidata.org/sparql is uncached, to get cached result you use something like query.wikidata.org/sparql?cached=120 to get result no older than 120 seconds ago. PRO: least surprise for default users. CON: relies on goodwill of tool writers, if somebody doesn't know about cache option and uses the same query heavily, we would have to ask them to use the parameter.
2. query.wikidata.org/sparql is cached for short duration (e.g. 1 minute) by default, if you'd like fresh result, you do something like query.wikidata.org/sparql?cached=0. If you're fine with older result, you can use query.wikidata.org/sparql?cached=3600 and get cached result if it's still in cache but by default you never get result older than 1 minute. This of course assuming Varnish magic can do this, if not, the scheme has to be amended. PRO: performance improvement while keeping default results reasonably fresh CON: it is not obvious that result is not the freshest data but can be stale, so if you update something in wikidata and query again within minute, you can be surprised
3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by default, if you'd like fresher result you do something like query.wikidata.org/sparql?cache=120 to get result no older than 2 minutes, or cache=0 if you want uncached one. PRO: best performance improvement for most queries, works well with queries that display data that rarely changes, such as lists, etc. CON: for people not knowing about cache option, in may be rather confusing to not be able to get up-to-date results.
So we'd like to hear - especially from current SPARQL endpoint users - what do you think about these and which would work for you?
Also, for the users of the WDQS GUI - provided we have cached and uncached options, which one the GUI should return by default? Should it be always uncached? Performance there is not a major question - the traffic to the GUI is pretty low - but rather convenience. Of course, if you run cached query from GUI and the data in cache, you can get results much faster for some queries. OTOH, it may be important in many cases to be able to access actual content up-to-date, not the cached version.
I also created a poll: https://phabricator.wikimedia.org/V8 so please feel free to vote for your favorite option.
OK, this letter is long enough already so I'll stop here and wait to hear what everybody's thinking.
Thanks in advance,
I have to say that I am dubious.
How often does *exactly* the same query get run within 2 minutes ?
Does the same query ever get run ?
The first thing to do, surely, is to create a hash for each query, (or better, perhaps, something like a tinyurl so then the lookup is reversible, record a timestamp for that hash each time the query is run, and then see even over a period of a month how many (if any) queries are being re-run, and if so how often.
I can imagine it's possible that particular tracking queries might be re-run (but probably (a) not every two minutes; and (b) not wanting the same result as last time).
Also perhaps queries with a published link might get re-run -- eg if somebody posts the link for a query-generated graph on twitter that gets a lot of re-tweets. (Or even just if Lydia posts it in the news of the week).
For queries like that, caching might well make sense (and save the server a potential slashdotting).
I'd guess there's probably only a very few queries like that though.
Possibly it's only worth caching a set of results if the same query has *already* been requested within the last n minutes ?
-- James
On 16/02/2016 22:47, Stas Malyshev wrote:
Hi!
With Wikidata Query Service usage raising and more use cases being found, it is time to consider caching infrastructure for results, since queries are expensive. One of the questions I would like to solicit feedback on is the following:
Should we have default SPARQL endpoint cached or uncached? If cached, which default cache duration would be good for most users? The cache, of course, applies to the results of the same (identical) query only. Please also note the following is not an implementation plan, but rather an opinion poll, whatever we end up deciding we will have an announcement with actual plan before we do it.
Also, whichever default we choose, there should be a possibility to get both cached and uncached results. The question is when you access the endpoint with no options, which one would it be. So possible variants are:
- query.wikidata.org/sparql is uncached, to get cached result you use
something like query.wikidata.org/sparql?cached=120 to get result no older than 120 seconds ago. PRO: least surprise for default users. CON: relies on goodwill of tool writers, if somebody doesn't know about cache option and uses the same query heavily, we would have to ask them to use the parameter.
- query.wikidata.org/sparql is cached for short duration (e.g. 1
minute) by default, if you'd like fresh result, you do something like query.wikidata.org/sparql?cached=0. If you're fine with older result, you can use query.wikidata.org/sparql?cached=3600 and get cached result if it's still in cache but by default you never get result older than 1 minute. This of course assuming Varnish magic can do this, if not, the scheme has to be amended. PRO: performance improvement while keeping default results reasonably fresh CON: it is not obvious that result is not the freshest data but can be stale, so if you update something in wikidata and query again within minute, you can be surprised
- query.wikidata.org/sparql is cached for long duration (e.g. hours) by
default, if you'd like fresher result you do something like query.wikidata.org/sparql?cache=120 to get result no older than 2 minutes, or cache=0 if you want uncached one. PRO: best performance improvement for most queries, works well with queries that display data that rarely changes, such as lists, etc. CON: for people not knowing about cache option, in may be rather confusing to not be able to get up-to-date results.
So we'd like to hear - especially from current SPARQL endpoint users - what do you think about these and which would work for you?
Also, for the users of the WDQS GUI - provided we have cached and uncached options, which one the GUI should return by default? Should it be always uncached? Performance there is not a major question - the traffic to the GUI is pretty low - but rather convenience. Of course, if you run cached query from GUI and the data in cache, you can get results much faster for some queries. OTOH, it may be important in many cases to be able to access actual content up-to-date, not the cached version.
I also created a poll: https://phabricator.wikimedia.org/V8 so please feel free to vote for your favorite option.
OK, this letter is long enough already so I'll stop here and wait to hear what everybody's thinking.
Thanks in advance,
I agree, we should look at some actual traffic to see how many queries /could/ be cached in a 2/5/10/60 min window. Maybe remove the example queries from those numbers, to separate the "production" and testing usage. Also, look at query runtime; if only "cheap" queries would be cached, there is no point in caching.
If caching would lead to significant savings, option 2 sounds sensible. Some people will get upset if their results aren't up-to-the-second, and being able to shift the blame at "server defaults" would be convenient ;-)
Option 3 sounds bad, because everyone and their cousin will just add an override to their tools, to prevent hours-old data to be served to the surprised users. WDQ has a ~10-15 min lag, that's about as much as people can stomach.
Once you run a query, you know both the runtime and the result size. Maybe expensive queries with a huge result set could be cached longer by default, and cheap/small queries not at all? If you expect your recent Wikidata edit to change the results from 3 to 4, you should see that ASAP; if the change would be 50.000 to 50.001, it seems less critical somehow.
On Tue, Feb 16, 2016 at 11:19 PM James Heald j.heald@ucl.ac.uk wrote:
I have to say that I am dubious.
How often does *exactly* the same query get run within 2 minutes ?
Does the same query ever get run ?
The first thing to do, surely, is to create a hash for each query, (or better, perhaps, something like a tinyurl so then the lookup is reversible, record a timestamp for that hash each time the query is run, and then see even over a period of a month how many (if any) queries are being re-run, and if so how often.
I can imagine it's possible that particular tracking queries might be re-run (but probably (a) not every two minutes; and (b) not wanting the same result as last time).
Also perhaps queries with a published link might get re-run -- eg if somebody posts the link for a query-generated graph on twitter that gets a lot of re-tweets. (Or even just if Lydia posts it in the news of the week).
For queries like that, caching might well make sense (and save the server a potential slashdotting).
I'd guess there's probably only a very few queries like that though.
Possibly it's only worth caching a set of results if the same query has *already* been requested within the last n minutes ?
-- James
On 16/02/2016 22:47, Stas Malyshev wrote:
Hi!
With Wikidata Query Service usage raising and more use cases being found, it is time to consider caching infrastructure for results, since queries are expensive. One of the questions I would like to solicit feedback on is the following:
Should we have default SPARQL endpoint cached or uncached? If cached, which default cache duration would be good for most users? The cache, of course, applies to the results of the same (identical) query only. Please also note the following is not an implementation plan, but rather an opinion poll, whatever we end up deciding we will have an announcement with actual plan before we do it.
Also, whichever default we choose, there should be a possibility to get both cached and uncached results. The question is when you access the endpoint with no options, which one would it be. So possible variants
are:
- query.wikidata.org/sparql is uncached, to get cached result you use
something like query.wikidata.org/sparql?cached=120 to get result no older than 120 seconds ago. PRO: least surprise for default users. CON: relies on goodwill of tool writers, if somebody doesn't know about cache option and uses the same query heavily, we would have to ask them to use the parameter.
- query.wikidata.org/sparql is cached for short duration (e.g. 1
minute) by default, if you'd like fresh result, you do something like query.wikidata.org/sparql?cached=0. If you're fine with older result, you can use query.wikidata.org/sparql?cached=3600 and get cached result if it's still in cache but by default you never get result older than 1 minute. This of course assuming Varnish magic can do this, if not, the scheme has to be amended. PRO: performance improvement while keeping default results reasonably
fresh
CON: it is not obvious that result is not the freshest data but can be stale, so if you update something in wikidata and query again within minute, you can be surprised
- query.wikidata.org/sparql is cached for long duration (e.g. hours) by
default, if you'd like fresher result you do something like query.wikidata.org/sparql?cache=120 to get result no older than 2 minutes, or cache=0 if you want uncached one. PRO: best performance improvement for most queries, works well with queries that display data that rarely changes, such as lists, etc. CON: for people not knowing about cache option, in may be rather confusing to not be able to get up-to-date results.
So we'd like to hear - especially from current SPARQL endpoint users - what do you think about these and which would work for you?
Also, for the users of the WDQS GUI - provided we have cached and uncached options, which one the GUI should return by default? Should it be always uncached? Performance there is not a major question - the traffic to the GUI is pretty low - but rather convenience. Of course, if you run cached query from GUI and the data in cache, you can get results much faster for some queries. OTOH, it may be important in many cases to be able to access actual content up-to-date, not the cached version.
I also created a poll: https://phabricator.wikimedia.org/V8 so please feel free to vote for your favorite option.
OK, this letter is long enough already so I'll stop here and wait to hear what everybody's thinking.
Thanks in advance,
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I agree, we should look at some actual traffic to see how many queries /could/ be cached in a 2/5/10/60 min window. Maybe remove the example queries from those numbers, to separate the "production" and testing usage. Also, look at query runtime; if only "cheap" queries would be cached, there is no point in caching.
Makes sense, but some of the use cases are not implemented yet, and I'm kind of scared of allowing them without caching - e.g. graph embedding - so it's hard to rely on past data.
Once you run a query, you know both the runtime and the result size. Maybe expensive queries with a huge result set could be cached longer by default, and cheap/small queries not at all? If you expect your recent Wikidata edit to change the results from 3 to 4, you should see that ASAP; if the change would be 50.000 to 50.001, it seems less critical somehow.
That sounds like a good idea, we'll need to check if Varnish allows us to do tricks like this...
Hi!
How often does *exactly* the same query get run within 2 minutes ?
Depends where the query is coming from. E.g. if there's a graph backed by query, then a lot of people can be seeing the graph and running a query. Same if somebody publishes a link to some query e.g. during a talk or in article and a bunch of people come to look at it. Depends on use case. Some use cases - like graphs - we are just planning, so we can't really rely on statistics here.
I'd guess there's probably only a very few queries like that though.
Well, maybe - we don't really know yet. That's why I want to hear opinions on this :)
Hi,
some random comments:
(1) Are there any concrete cases of applications that need "super-up-to-date" results (where 120 sec is too old)? I do not currently run or foresee to run any such application. Moreover, I think that you have to allow for at least 60sec for an update to make it into the RDF database, so 120sec seems to be already very close to the freshness you could get at all. My applications would be fine with getting updates every 10min.
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query costs to re-run and it could even know if a query is affected by a data update (a cache might still be the same as a current result even after many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
(3) I suspect queries to follow a long-tailish distribution (probably with some impurities), where a few queries are very frequent but most queries are rather rare. If this is true, then the caching should cut of the peak at the high end: the queries that run >100 or >1000 times per hour. This will already work well with a relatively short caching time. For example, with a 120sec caching time, a query can run at most 30 times per hour. You could go to 300 sec as well for at most 12 times per hour. Any query that you cannot afford to run 12 times per hour might have problems with or without a cache.
(4) In addition to balancing regular use as in (3), caching can also be vital to catch sudden burst of activity (a trending new Web application, a crawler that goes wild on another site, a developer who tries a new tool). Again, short caching intervals will be effective for this.
(5) I don't think you can get much benefit in caching costly, low-frequency queries. You would need a much longer caching interval to catch them, and would still only use the cache once or twice per query.
The points (3)-(5) are based on guessing. As Magnus said, some analysis could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Cheers,
Markus
On 16.02.2016 23:47, Stas Malyshev wrote:
Hi!
With Wikidata Query Service usage raising and more use cases being found, it is time to consider caching infrastructure for results, since queries are expensive. One of the questions I would like to solicit feedback on is the following:
Should we have default SPARQL endpoint cached or uncached? If cached, which default cache duration would be good for most users? The cache, of course, applies to the results of the same (identical) query only. Please also note the following is not an implementation plan, but rather an opinion poll, whatever we end up deciding we will have an announcement with actual plan before we do it.
Also, whichever default we choose, there should be a possibility to get both cached and uncached results. The question is when you access the endpoint with no options, which one would it be. So possible variants are:
- query.wikidata.org/sparql is uncached, to get cached result you use
something like query.wikidata.org/sparql?cached=120 to get result no older than 120 seconds ago. PRO: least surprise for default users. CON: relies on goodwill of tool writers, if somebody doesn't know about cache option and uses the same query heavily, we would have to ask them to use the parameter.
- query.wikidata.org/sparql is cached for short duration (e.g. 1
minute) by default, if you'd like fresh result, you do something like query.wikidata.org/sparql?cached=0. If you're fine with older result, you can use query.wikidata.org/sparql?cached=3600 and get cached result if it's still in cache but by default you never get result older than 1 minute. This of course assuming Varnish magic can do this, if not, the scheme has to be amended. PRO: performance improvement while keeping default results reasonably fresh CON: it is not obvious that result is not the freshest data but can be stale, so if you update something in wikidata and query again within minute, you can be surprised
- query.wikidata.org/sparql is cached for long duration (e.g. hours) by
default, if you'd like fresher result you do something like query.wikidata.org/sparql?cache=120 to get result no older than 2 minutes, or cache=0 if you want uncached one. PRO: best performance improvement for most queries, works well with queries that display data that rarely changes, such as lists, etc. CON: for people not knowing about cache option, in may be rather confusing to not be able to get up-to-date results.
So we'd like to hear - especially from current SPARQL endpoint users - what do you think about these and which would work for you?
Also, for the users of the WDQS GUI - provided we have cached and uncached options, which one the GUI should return by default? Should it be always uncached? Performance there is not a major question - the traffic to the GUI is pretty low - but rather convenience. Of course, if you run cached query from GUI and the data in cache, you can get results much faster for some queries. OTOH, it may be important in many cases to be able to access actual content up-to-date, not the cached version.
I also created a poll: https://phabricator.wikimedia.org/V8 so please feel free to vote for your favorite option.
OK, this letter is long enough already so I'll stop here and wait to hear what everybody's thinking.
Thanks in advance,
Hi!
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query costs to re-run and it could even know if a query is affected by a data
BlazeGraph does a lot of caching, but it's limited by the memory and it AFAIK does not do whole query caching (like mysql does, for example) - which means if you run two big queries one after another, the latter could remove from cache what the former put there. Its caching, AFAIK, is on much lower level. Which is helpful too since different queries share a lot of underlying data, but not exactly our case here.
update (a cache might still be the same as a current result even after many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
I don't think Blazegraph does anything like resolving changes to see if query results changed, that sound like pretty hard thing to do in triple store. You can manually store specific query result AFAIK but that's just form of writing data as I understand and may not be very scalable.
The points (3)-(5) are based on guessing. As Magnus said, some analysis could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Thanks,
On 17.02.2016 08:16, Stas Malyshev wrote:
Hi!
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query costs to re-run and it could even know if a query is affected by a data
BlazeGraph does a lot of caching, but it's limited by the memory and it AFAIK does not do whole query caching (like mysql does, for example) - which means if you run two big queries one after another, the latter could remove from cache what the former put there. Its caching, AFAIK, is on much lower level. Which is helpful too since different queries share a lot of underlying data, but not exactly our case here.
update (a cache might still be the same as a current result even after many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
I don't think Blazegraph does anything like resolving changes to see if query results changed, that sound like pretty hard thing to do in triple store. You can manually store specific query result AFAIK but that's just form of writing data as I understand and may not be very scalable.
Yes, in general this would be extremely hard. There are some easy cases one could catch, but it is not clear how effective this would be for our load. I am just saying we should not try to build a query-aware caching strategy that would better be done on a lower level.
The points (3)-(5) are based on guessing. As Magnus said, some analysis could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Yes, it is hard to predict what load this will create. The caching levels around Wikipedia prevent re-computation of the page on most page views, so maybe there would not actually be very many repeated requests for the same query coming from tOne option could be a dedicated caching layer just for such wiki uses. On the one hand, the set of all embedded queries is known upfront (so, in contrast to other uses, you already know which queries will be asked). On the other hand, users may wish to do a forced refresh his side. The main danger again seems to be bursts of activity (a page getting a lot of edits in a short time, and each edit invalidates the ParserCache and requires refetching query results). On the positive side, this specific usage of WDQS can pass its own caching parameters (which we can control), so if there is a caching layer in place, one could react to issues on short notice by being more conservative there than for other queries.
The interesting thing about the wiki-embedding usage is that it requires quick propagation of changes. Scenario: a user visits a Wikipedia page with a map created from a query; the user finds an outdated item on the map; she goes to Wikidata to fix it, and refreshes (edits) the page to see the change. Now if she is too quick, the change will not have made it into the query result yet -- she could try in a minute or so. However, if we have a long caching period, her first query will have populated the cache and prevent the update from showing for the maximal amount of time (the whole cache period). This seems like a case where long caching would be rather bad for user experience.
Markus
On Wed, Feb 17, 2016 at 9:39 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 17.02.2016 08:16, Stas Malyshev wrote:
Hi!
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query
costs to re-run and it could even know if a query is affected by a data
BlazeGraph does a lot of caching, but it's limited by the memory and it AFAIK does not do whole query caching (like mysql does, for example) - which means if you run two big queries one after another, the latter could remove from cache what the former put there. Its caching, AFAIK, is on much lower level. Which is helpful too since different queries share a lot of underlying data, but not exactly our case here.
update (a cache might still be the same as a current result even after
many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
I don't think Blazegraph does anything like resolving changes to see if query results changed, that sound like pretty hard thing to do in triple store. You can manually store specific query result AFAIK but that's just form of writing data as I understand and may not be very scalable.
Yes, in general this would be extremely hard. There are some easy cases one could catch, but it is not clear how effective this would be for our load. I am just saying we should not try to build a query-aware caching strategy that would better be done on a lower level.
The points (3)-(5) are based on guessing. As Magnus said, some analysis
could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Yes, it is hard to predict what load this will create. The caching levels around Wikipedia prevent re-computation of the page on most page views, so maybe there would not actually be very many repeated requests for the same query coming from tOne option could be a dedicated caching layer just for such wiki uses. On the one hand, the set of all embedded queries is known upfront (so, in contrast to other uses, you already know which queries will be asked). On the other hand, users may wish to do a forced refresh his side. The main danger again seems to be bursts of activity (a page getting a lot of edits in a short time, and each edit invalidates the ParserCache and requires refetching query results). On the positive side, this specific usage of WDQS can pass its own caching parameters (which we can control), so if there is a caching layer in place, one could react to issues on short notice by being more conservative there than for other queries.
The interesting thing about the wiki-embedding usage is that it requires quick propagation of changes. Scenario: a user visits a Wikipedia page with a map created from a query; the user finds an outdated item on the map; she goes to Wikidata to fix it, and refreshes (edits) the page to see the change. Now if she is too quick, the change will not have made it into the query result yet -- she could try in a minute or so. However, if we have a long caching period, her first query will have populated the cache and prevent the update from showing for the maximal amount of time (the whole cache period). This seems like a case where long caching would be rather bad for user experience.
I think it would be nice if having a graph with query on a page does not too much adversely affect the time it takes to save a page. (e.g. if running the query takes 20 seconds..., and instead reuse cached query results) And not have such usage kill / overwhelm the query service, is also important.
If we incorporate entity usage or something like that, then maybe that could be used to handle cache invalidation in cases something used in a query changed.
Cheers, Katie
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Basicly I have two use-cases of the SPARQL endpoint. 1. Concept finding for bot activities, 2. example /tutorial/show-case queries. Starting with the second, especially if it is on prototyping an (extensive) caching time is totally acceptable to me and definitely worth it would it improve the overall performance of the endpoint.
In bot activities, currently the bot freezes when the last update of the WDQS exceeds 5 minutes. The main reason for using the WDQS in our bot efforts is for concept resolution (i.e. does a concept and one or more of its properties already exists on WD). The chances that a duplicate items are created within 5 minutes are slim and could if it happens are easily fixed manually.
So if it would improve the performance or stability of the WDQS, I would certainly vote for option 3.
Would it be possible to implement a select box in the GUI where users can select the preferred caching time? Such a feature would show the existence of different caching times to new users.
Just my 2cts
Andra
On Wed, Feb 17, 2016 at 9:54 AM, Katie Filbert katie.filbert@wikimedia.de wrote:
On Wed, Feb 17, 2016 at 9:39 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 17.02.2016 08:16, Stas Malyshev wrote:
Hi!
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query
costs to re-run and it could even know if a query is affected by a data
BlazeGraph does a lot of caching, but it's limited by the memory and it AFAIK does not do whole query caching (like mysql does, for example) - which means if you run two big queries one after another, the latter could remove from cache what the former put there. Its caching, AFAIK, is on much lower level. Which is helpful too since different queries share a lot of underlying data, but not exactly our case here.
update (a cache might still be the same as a current result even after
many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
I don't think Blazegraph does anything like resolving changes to see if query results changed, that sound like pretty hard thing to do in triple store. You can manually store specific query result AFAIK but that's just form of writing data as I understand and may not be very scalable.
Yes, in general this would be extremely hard. There are some easy cases one could catch, but it is not clear how effective this would be for our load. I am just saying we should not try to build a query-aware caching strategy that would better be done on a lower level.
The points (3)-(5) are based on guessing. As Magnus said, some analysis
could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Yes, it is hard to predict what load this will create. The caching levels around Wikipedia prevent re-computation of the page on most page views, so maybe there would not actually be very many repeated requests for the same query coming from tOne option could be a dedicated caching layer just for such wiki uses. On the one hand, the set of all embedded queries is known upfront (so, in contrast to other uses, you already know which queries will be asked). On the other hand, users may wish to do a forced refresh his side. The main danger again seems to be bursts of activity (a page getting a lot of edits in a short time, and each edit invalidates the ParserCache and requires refetching query results). On the positive side, this specific usage of WDQS can pass its own caching parameters (which we can control), so if there is a caching layer in place, one could react to issues on short notice by being more conservative there than for other queries.
The interesting thing about the wiki-embedding usage is that it requires quick propagation of changes. Scenario: a user visits a Wikipedia page with a map created from a query; the user finds an outdated item on the map; she goes to Wikidata to fix it, and refreshes (edits) the page to see the change. Now if she is too quick, the change will not have made it into the query result yet -- she could try in a minute or so. However, if we have a long caching period, her first query will have populated the cache and prevent the update from showing for the maximal amount of time (the whole cache period). This seems like a case where long caching would be rather bad for user experience.
I think it would be nice if having a graph with query on a page does not too much adversely affect the time it takes to save a page. (e.g. if running the query takes 20 seconds..., and instead reuse cached query results) And not have such usage kill / overwhelm the query service, is also important.
If we incorporate entity usage or something like that, then maybe that could be used to handle cache invalidation in cases something used in a query changed.
Cheers, Katie
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Katie Filbert Wikidata Developer
Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0
Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 17.02.2016 09:54, Katie Filbert wrote: ...
I think it would be nice if having a graph with query on a page does not too much adversely affect the time it takes to save a page. (e.g. if running the query takes 20 seconds..., and instead reuse cached query results) And not have such usage kill / overwhelm the query service, is also important.
If we incorporate entity usage or something like that, then maybe that could be used to handle cache invalidation in cases something used in a query changed.
This might be one of the more complex cache maintenance strategies that I had delegated to BlazeGraph above. It is not too hard to monitor objects in a query result for changes, but for cache invalidation to work reliably, you would also have to watch out for items that are only becoming part of the result because of the changes. For example, a query for the largest cities would need to be updated if someone creates a new city (item) that is larger than all other cities. So you have to monitor all items to update the query, not just those used in the current result.
Markus
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Katie Filbert Wikidata Developer
Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0
Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Am 17.02.2016 um 09:54 schrieb Katie Filbert:
On Wed, Feb 17, 2016 at 9:39 AM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 17.02.2016 08:16, Stas Malyshev wrote:
Hi!
(2) Shouldn't BlazeGraph do the caching (too)? It knows how much a query costs to re-run and it could even know if a query is affected by a data
BlazeGraph does a lot of caching, but it's limited by the memory and it AFAIK does not do whole query caching (like mysql does, for example) - which means if you run two big queries one after another, the latter could remove from cache what the former put there. Its caching, AFAIK, is on much lower level. Which is helpful too since different queries share a lot of underlying data, but not exactly our case here.
update (a cache might still be the same as a current result even after many data changes). Having several caching layers is useful, but the more elaborate (query-structure dependent) caching strategies should maybe be left to the database.
I don't think Blazegraph does anything like resolving changes to see if query results changed, that sound like pretty hard thing to do in triple store. You can manually store specific query result AFAIK but that's just form of writing data as I understand and may not be very scalable.
Yes, in general this would be extremely hard. There are some easy cases one could catch, but it is not clear how effective this would be for our load. I am just saying we should not try to build a query-aware caching strategy that would better be done on a lower level.
The points (3)-(5) are based on guessing. As Magnus said, some analysis could help to confirm or refute this. On the other hand, caching should not just focus on current usage patterns only, but consider a bit what could happen in the future.
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Yes, it is hard to predict what load this will create. The caching levels around Wikipedia prevent re-computation of the page on most page views, so maybe there would not actually be very many repeated requests for the same query coming from tOne option could be a dedicated caching layer just for such wiki uses. On the one hand, the set of all embedded queries is known upfront (so, in contrast to other uses, you already know which queries will be asked). On the other hand, users may wish to do a forced refresh his side. The main danger again seems to be bursts of activity (a page getting a lot of edits in a short time, and each edit invalidates the ParserCache and requires refetching query results). On the positive side, this specific usage of WDQS can pass its own caching parameters (which we can control), so if there is a caching layer in place, one could react to issues on short notice by being more conservative there than for other queries.
The interesting thing about the wiki-embedding usage is that it requires quick propagation of changes. Scenario: a user visits a Wikipedia page with a map created from a query; the user finds an outdated item on the map; she goes to Wikidata to fix it, and refreshes (edits) the page to see the change. Now if she is too quick, the change will not have made it into the query result yet -- she could try in a minute or so. However, if we have a long caching period, her first query will have populated the cache and prevent the update from showing for the maximal amount of time (the whole cache period). This seems like a case where long caching would be rather bad for user experience.
I think it would be nice if having a graph with query on a page does not too much adversely affect the time it takes to save a page. (e.g. if running the query takes 20 seconds..., and instead reuse cached query results) And not have such usage kill / overwhelm the query service, is also important.
If we incorporate entity usage or something like that, then maybe that could be used to handle cache invalidation in cases something used in a query changed.
Cheers, Katie
I believe that this could be solved most easily by not letting queries to be entered directly on wiki pages but have separate pages for them where one can examine the result and see the last run of the query and trigger a re-run. The page on the wiki embedding the query would then be independent from the query service but only use the query result stored somewhere in the wiki.
This seems to be a very transparent way for the user to see the status of the query because it provides a separate page to "manage" the query. One could maybe also specify automatic run-intervals etc.
Best regards Bene
On Wed, Feb 17, 2016 at 7:16 AM Stas Malyshev smalyshev@wikimedia.org wrote:
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Wouldn't those usecases be wrapped in an extension or WMF-controlled
JavaScript? In that case, queries could always indicate that use, and they could be cached, for hours if need be. No reason to put everything behind a long cache by default, just because of those controllable cases.
Also, what about creating more independent blazegraph instances? One (or more) could be for wiki extension queries, with long cache; others could be for Labs use (internal network only?), a "general" external-facing server, etc.
If the problem (and it's not even certain we have one) can be mitigated or solved with throwing a few more VMs at it, I'm all for it :-)
On 17.02.2016 10:34, Magnus Manske wrote:
On Wed, Feb 17, 2016 at 7:16 AM Stas Malyshev <smalyshev@wikimedia.org mailto:smalyshev@wikimedia.org> wrote:
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Wouldn't those usecases be wrapped in an extension or WMF-controlled JavaScript? In that case, queries could always indicate that use, and they could be cached, for hours if need be. No reason to put everything behind a long cache by default, just because of those controllable cases.
Also, what about creating more independent blazegraph instances? One (or more) could be for wiki extension queries, with long cache; others could be for Labs use (internal network only?), a "general" external-facing server, etc.
If the problem (and it's not even certain we have one) can be mitigated or solved with throwing a few more VMs at it, I'm all for it :-)
+1 for adding some servers before building complicated caching solutions
I think long caching periods for wiki queries could lead to user frustration for the reasons I gave in my other post. But maybe one can simply give the user a way to say "please recompute this query now" to avoid this.
Another thing one could do for wiki-based (and other) queries is to use caches as fallbacks in case of timeouts: "we will try our best to give you a fresh result, but if current load is too high, we will at least give you some older result." This makes most sense for wiki-based queries which repeat reliably over time (so it makes sense to keep one result, however old it is, for all queries still used on some wiki page). Would be more work to implement, so probably not the first thing to do without any real wiki usage experiences yet.
Markus
If you add a proxy cache like Varnish in front of the endpoint, it will cache based on the Cache-Control: max-age and ETag headers sent by the endpoint, which I guess can be configured. But you can also PURGE and BAN specific cache entries from Varnish to force fresh retrieval.
On Wed, Feb 17, 2016 at 10:52 AM, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 17.02.2016 10:34, Magnus Manske wrote:
On Wed, Feb 17, 2016 at 7:16 AM Stas Malyshev <smalyshev@wikimedia.org mailto:smalyshev@wikimedia.org> wrote:
Well, again the problem is that one use case that I think absolutely needs caching - namely, exporting data to graphs, maps, etc. deployed
on wiki pages - is also the one not implemented yet because we don't have cache (not only, but one of the things we need) so we've got chicken and egg problem here :) Of course, we can just choose something now based on educated guess and change it later if it works badly. That's probably what we'll do.
Wouldn't those usecases be wrapped in an extension or WMF-controlled JavaScript? In that case, queries could always indicate that use, and they could be cached, for hours if need be. No reason to put everything behind a long cache by default, just because of those controllable cases.
Also, what about creating more independent blazegraph instances? One (or more) could be for wiki extension queries, with long cache; others could be for Labs use (internal network only?), a "general" external-facing server, etc.
If the problem (and it's not even certain we have one) can be mitigated or solved with throwing a few more VMs at it, I'm all for it :-)
+1 for adding some servers before building complicated caching solutions
I think long caching periods for wiki queries could lead to user frustration for the reasons I gave in my other post. But maybe one can simply give the user a way to say "please recompute this query now" to avoid this.
Another thing one could do for wiki-based (and other) queries is to use caches as fallbacks in case of timeouts: "we will try our best to give you a fresh result, but if current load is too high, we will at least give you some older result." This makes most sense for wiki-based queries which repeat reliably over time (so it makes sense to keep one result, however old it is, for all queries still used on some wiki page). Would be more work to implement, so probably not the first thing to do without any real wiki usage experiences yet.
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 17/02/2016 06:48, Markus Krötzsch wrote:
some random comments:
(1) Are there any concrete cases of applications that need "super-up-to-date" results (where 120 sec is too old)? I do not currently run or foresee to run any such application. Moreover, I think that you have to allow for at least 60sec for an update to make it into the RDF database, so 120sec seems to be already very close to the freshness you could get at all. My applications would be fine with getting updates every 10min.
Personally, I have quite often used WDQS to generate lists of items to things needing to be fixed on Wikidata.
Having then done some fixes (typically by hand), I'll then re-run the query to see what still needs to be done.
At this point it's quite frustrating if the database is lagging -- what I want is an up-to-date representation of what still needs to be fixed; or whether everything is now done.
So for this kind of use, the quicker an edit gets propagated to the search results the better.
That said, I'm okay to put up with some occassional lag -- for example, if I know the lag is ten minutes, I can go away and make a cup of coffee, or check the Wikidata email list, or wherever the latest "knowledge engine" paranoia has got to. But (for this kind of use anyway), more than the occasional ten-minute delay starts to get annoying. (Which is why there should be big props to all the time that the SPARQL service has usually very responsive to recent edits).
How relevant this mode of use is for caching I am not sure, because typically I'd do a certain amount of editing before re-running the query.
But possibly if I found there was one edit I had missed, made the edit, then re-ran the query to see if I'd finally got the output to look all just as it should -- that might happen within a 120 second turnaround; so one would want at least to be able to purge the results and re-run.
-- James.
Well, another use case for nearly-immediate updates:
I'll do a presentation next week, in which I intend to demonstrate that I can add a Wikidata value online, which then is available immediately for my application - as well as for the whole rest of the world. (In Library Land, that's a real blast, because business processes related to authority data often take weeks or month ...)
That is an rather exotic and very infrequent use. Similar to James' use case, (if I didn't get him wrong) it is not necessary to run these kind of queries in production-strength settings. Perhaps, an current, un-cached "experimental" / "unstable" endpoit could solve these kinds of use, too.
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von James Heald Gesendet: Mittwoch, 17. Februar 2016 17:21 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL endpoint caching
On 17/02/2016 06:48, Markus Krötzsch wrote:
some random comments:
(1) Are there any concrete cases of applications that need "super-up-to-date" results (where 120 sec is too old)? I do not currently run or foresee to run any such application. Moreover, I think that you have to allow for at least 60sec for an update to make it into the RDF database, so 120sec seems to be already very close to the freshness you could get at all. My applications would be fine with getting updates every 10min.
Personally, I have quite often used WDQS to generate lists of items to things needing to be fixed on Wikidata.
Having then done some fixes (typically by hand), I'll then re-run the query to see what still needs to be done.
At this point it's quite frustrating if the database is lagging -- what I want is an up-to-date representation of what still needs to be fixed; or whether everything is now done.
So for this kind of use, the quicker an edit gets propagated to the search results the better.
That said, I'm okay to put up with some occassional lag -- for example, if I know the lag is ten minutes, I can go away and make a cup of coffee, or check the Wikidata email list, or wherever the latest "knowledge engine" paranoia has got to. But (for this kind of use anyway), more than the occasional ten-minute delay starts to get annoying. (Which is why there should be big props to all the time that the SPARQL service has usually very responsive to recent edits).
How relevant this mode of use is for caching I am not sure, because typically I'd do a certain amount of editing before re-running the query.
But possibly if I found there was one edit I had missed, made the edit, then re-ran the query to see if I'd finally got the output to look all just as it should -- that might happen within a 120 second turnaround; so one would want at least to be able to purge the results and re-run.
-- James.
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I'll do a presentation next week, in which I intend to demonstrate that I can add a Wikidata value online, which then is available immediately for my application - as well as for the whole rest of the world. (In Library Land, that's a real blast, because business processes related to authority data often take weeks or month ...)
I think we'll always have some way to run un-cached query. The question is only how easy would it be - i.e. would you need to add parameter, click a checkbox, etc.