Hi all,
At around 1PM UTC today (Sep 3) we started experiencing stability issues with WDQS, localized (at least at the moment) to a single, of two, datacenter. Unfortunately, we haven't been able to pinpoint the issue as of now. We suspect that someone is running a query that affects Blazegraph - that happened a few times in the past. Unfortunately, our usual tactics did help us to find which one.
We are working on identifying the issue, but it's clear that this could in a few hours bring the service down, so we are working on a quick workaround. Since we observed the issue is only causing actual service failures after ~2h after restart, for now we are going to introduce a procedure that will restart servers randomly, so that uptime for each will be at max around 1h. Only one server should be restarted at any given time. This will cause some queries to be killed, when each of the servers is restarted, but the alternative is worse.
We'll continue to work to find the root cause and will inform you of all of our progress. We will also post our progress here: [1].
Regards, Zbyszko Papierski
[1] https://phabricator.wikimedia.org/T290330