SPARQL service timeouts - Wikidata

Stas Malyshev

18 Apr 18 Apr

7:40 p.m.

Hi!

I have the impression that some not-so-easy SPARQL queries that used to run just below the timeout are now timing out regularly. Has there been a change in the setup that may have caused this, or are we maybe seeing increased query traffic [1]?

We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

...

[1] The deadline for the Int. Semantic Web Conf. is coming up, so it might be that someone is running experiments on the system to get their paper finished. It has been observed for other endpoints that traffic increases at such times. This community sometimes is the greatest enemy of its own technology ... (I recently had to IP-block an RDF crawler from one of my sites after it had ignored robots.txt completely).

We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that. -- Stas Malyshev smalyshev(a)wikimedia.org

Reply

Markus Kroetzsch

8:21 p.m.

On 18.04.2016 21:56, Markus Kroetzsch wrote:

...

Thanks, the dashboard is interesting. I am trying to run this query: SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC } It is supposed to return a large result set. But I am only running it once per week. It used to work fine, but today I could not get it to succeed a single time.

Actually, the query seems to work as it should. I am investigating why I get an error in some cases on my machine. Markus

...

On 18.04.2016 21:40, Stas Malyshev wrote:

Hi!

I have the impression that some not-so-easy SPARQL queries that used to run just below the timeout are now timing out regularly. Has there been a change in the setup that may have caused this, or are we maybe seeing increased query traffic [1]?

We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

[1] The deadline for the Int. Semantic Web Conf. is coming up, so it might be that someone is running experiments on the system to get their paper finished. It has been observed for other endpoints that traffic increases at such times. This community sometimes is the greatest enemy of its own technology ... (I recently had to IP-block an RDF crawler from one of my sites after it had ignored robots.txt completely).

We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Reply

Addshore

19 Apr 19 Apr

9:05 a.m.

In the case we are discussing here the truncated JSON is caused by blaze graph deciding it has been sending data for too long and then stopping (as I understand). Thus you will only see a spike on the graph for the amount of data actually sent from the server, not the size of the result blazegraph was trying to send back. I also ran into this with some simple queries that returned big sets of data. Although with my issue I did actually also see a Java exception somewhere. On 18 April 2016 at 21:51, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> wrote:

...

On 18.04.2016 22:21, Markus Kroetzsch wrote:

On 18.04.2016 21:56, Markus Kroetzsch wrote:

Thanks, the dashboard is interesting. I am trying to run this query: SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC } It is supposed to return a large result set. But I am only running it once per week. It used to work fine, but today I could not get it to succeed a single time.

Actually, the query seems to work as it should. I am investigating why I get an error in some cases on my machine.

Ok, I found that this is not so easy to reproduce reliably. The symptom I am seeing is a truncated JSON response, which just stops in the middle of the data (at a random location, but usually early on), and which is *not* followed by any error message. The stream just ends. So far, I could only get this in Java, not in Python, and it does not always happen. If successful, the result is about 250M in size. The following Python script can retrieve it: import requests SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql' query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }""" print requests.get(SPARQL_SERVICE_URL, params={'query': query, 'format': 'json'}).text (output should be redirected to a file) I will keep an eye on the issue, but I don't know how to debug this any further now, since it started to work without me changing any code. I also wonder how to read the dashboard after all. In spite of me repeating an experiment that creates a 250M result file for five times in the past few minutes, the "Bytes out" figure remains below a few MB for most of the time. Markus

On 18.04.2016 21:40, Stas Malyshev wrote:

Hi! I have the impression that some not-so-easy SPARQL queries that used to > run just below the timeout are now timing out regularly. Has there been > a change in the setup that may have caused this, or are we maybe seeing > increased query traffic [1]? > We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service [1] The deadline for the Int. Semantic Web Conf. is coming up, so it > might be that someone is running experiments on the system to get their > paper finished. It has been observed for other endpoints that traffic > increases at such times. This community sometimes is the greatest enemy > of its own technology ... (I recently had to IP-block an RDF crawler > from one of my sites after it had ignored robots.txt completely). > We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Addshore

Reply

Markus Krötzsch

9:42 a.m.

On 19.04.2016 11:33, Addshore wrote:

...

Also per https://phabricator.wikimedia.org/T126730 and https://gerrit.wikimedia.org/r/#/c/274864/8 requests to the query service are now cached for 60 seconds. I expect this will include error results from timeouts so retrying a request within the same 60 seconds as the first won't event reach the WDQS servers now.

Maybe this could be the answer. Is it possible that the cache stores the truncated result but not the Java exception? Then the behaviour could be a timeout which just is not reported properly. Ideally, partial results should not be cached or the "timeout" should be cached so that a renewed request (in 60sec) returns an immediate timeout rather than a broken result set. Cheers, Markus

...

On 19 April 2016 at 10:05, Addshore <addshorewiki(a)gmail.com <mailto:addshorewiki@gmail.com>> wrote: In the case we are discussing here the truncated JSON is caused by blaze graph deciding it has been sending data for too long and then stopping (as I understand). Thus you will only see a spike on the graph for the amount of data actually sent from the server, not the size of the result blazegraph was trying to send back. I also ran into this with some simple queries that returned big sets of data. Although with my issue I did actually also see a Java exception somewhere. On 18 April 2016 at 21:51, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: On 18.04.2016 22:21, Markus Kroetzsch wrote: On 18.04.2016 21:56, Markus Kroetzsch wrote: Thanks, the dashboard is interesting. I am trying to run this query: SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC } It is supposed to return a large result set. But I am only running it once per week. It used to work fine, but today I could not get it to succeed a single time. Actually, the query seems to work as it should. I am investigating why I get an error in some cases on my machine. Ok, I found that this is not so easy to reproduce reliably. The symptom I am seeing is a truncated JSON response, which just stops in the middle of the data (at a random location, but usually early on), and which is *not* followed by any error message. The stream just ends. So far, I could only get this in Java, not in Python, and it does not always happen. If successful, the result is about 250M in size. The following Python script can retrieve it: import requests SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql' query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }""" print requests.get(SPARQL_SERVICE_URL, params={'query': query, 'format': 'json'}).text (output should be redirected to a file) I will keep an eye on the issue, but I don't know how to debug this any further now, since it started to work without me changing any code. I also wonder how to read the dashboard after all. In spite of me repeating an experiment that creates a 250M result file for five times in the past few minutes, the "Bytes out" figure remains below a few MB for most of the time. Markus On 18.04.2016 21:40, Stas Malyshev wrote: Hi! I have the impression that some not-so-easy SPARQL queries that used to run just below the timeout are now timing out regularly. Has there been a change in the setup that may have caused this, or are we maybe seeing increased query traffic [1]? We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service [1] The deadline for the Int. Semantic Web Conf. is coming up, so it might be that someone is running experiments on the system to get their paper finished. It has been observed for other endpoints that traffic increases at such times. This community sometimes is the greatest enemy of its own technology ... (I recently had to IP-block an RDF crawler from one of my sites after it had ignored robots.txt completely). We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that. -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata -- Addshore -- Addshore _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply

Guillaume Lederrey

10:33 a.m.

I'm late to the game, but a quick look into the nginx logs does not show all that much. I see a few connection refused, but that should translate in an HTTP 502 error, not in a partial answer. I'm really not good at reading VCL, but it seems that we do have some rules in our Varnish config to cache pages in error. This would make sense as pages in error tend to be expensive, so we probably want to ensure the same error is capped at a maximum rate. I'll keep looking. But transient errors are hard... On Tue, Apr 19, 2016 at 11:44 AM, Addshore <addshorewiki(a)gmail.com> wrote:

...

Yes the size reported there will be the compressed size, so actual bytes over the port! Looking at the patch further it looks like some nginx settings were changed while caching was enabled that may also be worth looking at. On 19 April 2016 at 10:42, Markus Krötzsch <markus(a)semantic-mediawiki.org> wrote:

On 19.04.2016 11:33, Addshore wrote:

Also per https://phabricator.wikimedia.org/T126730 and https://gerrit.wikimedia.org/r/#/c/274864/8 requests to the query service are now cached for 60 seconds. I expect this will include error results from timeouts so retrying a request within the same 60 seconds as the first won't event reach the WDQS servers now.

Maybe this could be the answer. Is it possible that the cache stores the truncated result but not the Java exception? Then the behaviour could be a timeout which just is not reported properly. Ideally, partial results should not be cached or the "timeout" should be cached so that a renewed request (in 60sec) returns an immediate timeout rather than a broken result set. Cheers, Markus

On 19 April 2016 at 10:05, Addshore <addshorewiki(a)gmail.com <mailto:addshorewiki@gmail.com>> wrote: In the case we are discussing here the truncated JSON is caused by blaze graph deciding it has been sending data for too long and then stopping (as I understand). Thus you will only see a spike on the graph for the amount of data actually sent from the server, not the size of the result blazegraph was trying to send back. I also ran into this with some simple queries that returned big sets of data. Although with my issue I did actually also see a Java exception somewhere. On 18 April 2016 at 21:51, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: On 18.04.2016 22:21, Markus Kroetzsch wrote: On 18.04.2016 21:56, Markus Kroetzsch wrote: Thanks, the dashboard is interesting. I am trying to run this query: SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC } It is supposed to return a large result set. But I am only running it once per week. It used to work fine, but today I could not get it to succeed a single time. Actually, the query seems to work as it should. I am investigating why I get an error in some cases on my machine. Ok, I found that this is not so easy to reproduce reliably. The symptom I am seeing is a truncated JSON response, which just stops in the middle of the data (at a random location, but usually early on), and which is *not* followed by any error message. The stream just ends. So far, I could only get this in Java, not in Python, and it does not always happen. If successful, the result is about 250M in size. The following Python script can retrieve it: import requests SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql' query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }""" print requests.get(SPARQL_SERVICE_URL, params={'query': query, 'format': 'json'}).text (output should be redirected to a file) I will keep an eye on the issue, but I don't know how to debug this any further now, since it started to work without me changing any code. I also wonder how to read the dashboard after all. In spite of me repeating an experiment that creates a 250M result file for five times in the past few minutes, the "Bytes out" figure remains below a few MB for most of the time. Markus On 18.04.2016 21:40, Stas Malyshev wrote: Hi! I have the impression that some not-so-easy SPARQL queries that used to run just below the timeout are now timing out regularly. Has there been a change in the setup that may have caused this, or are we maybe seeing increased query traffic [1]? We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service [1] The deadline for the Int. Semantic Web Conf. is coming up, so it might be that someone is running experiments on the system to get their paper finished. It has been observed for other endpoints that traffic increases at such times. This community sometimes is the greatest enemy of its own technology ... (I recently had to IP-block an RDF crawler from one of my sites after it had ignored robots.txt completely). We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that. -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata -- Addshore -- Addshore _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Addshore _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation

Reply

Markus Krötzsch

9:39 a.m.

On 19.04.2016 11:05, Addshore wrote:

...

In the case we are discussing here the truncated JSON is caused by blaze graph deciding it has been sending data for too long and then stopping (as I understand). Thus you will only see a spike on the graph for the amount of data actually sent from the server, not the size of the result blazegraph was trying to send back.

I successfully got five files of 250M JSON each, but even those successful queries did not show up in the stats. The five files had three different versions (slightly different sizes) so they did not all come from a common cache either. Maybe the size is counted in terms of compressed or otherwise "raw" results?

...

I also ran into this with some simple queries that returned big sets of data. Although with my issue I did actually also see a Java exception somewhere.

I know the case where large result sets end in a Java timeout exception. This occurs reproducibly when you retrieve all humans or something like that. However, in my case, the behaviour is not always reproducible and there is no Java exception at the end of the output; it just stops in the middle of the file. Markus

...

On 18 April 2016 at 21:51, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: On 18.04.2016 22:21, Markus Kroetzsch wrote: On 18.04.2016 21:56, Markus Kroetzsch wrote: Thanks, the dashboard is interesting. I am trying to run this query: SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC } It is supposed to return a large result set. But I am only running it once per week. It used to work fine, but today I could not get it to succeed a single time. Actually, the query seems to work as it should. I am investigating why I get an error in some cases on my machine. Ok, I found that this is not so easy to reproduce reliably. The symptom I am seeing is a truncated JSON response, which just stops in the middle of the data (at a random location, but usually early on), and which is *not* followed by any error message. The stream just ends. So far, I could only get this in Java, not in Python, and it does not always happen. If successful, the result is about 250M in size. The following Python script can retrieve it: import requests SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql' query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }""" print requests.get(SPARQL_SERVICE_URL, params={'query': query, 'format': 'json'}).text (output should be redirected to a file) I will keep an eye on the issue, but I don't know how to debug this any further now, since it started to work without me changing any code. I also wonder how to read the dashboard after all. In spite of me repeating an experiment that creates a 250M result file for five times in the past few minutes, the "Bytes out" figure remains below a few MB for most of the time. Markus On 18.04.2016 21:40, Stas Malyshev wrote: Hi! I have the impression that some not-so-easy SPARQL queries that used to run just below the timeout are now timing out regularly. Has there been a change in the setup that may have caused this, or are we maybe seeing increased query traffic [1]? We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service [1] The deadline for the Int. Semantic Web Conf. is coming up, so it might be that someone is running experiments on the system to get their paper finished. It has been observed for other endpoints that traffic increases at such times. This community sometimes is the greatest enemy of its own technology ... (I recently had to IP-block an RDF crawler from one of my sites after it had ignored robots.txt completely). We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that. -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata -- Addshore _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply