Is this the right place to ask this?
I run the wikidata author-disambiguator - https://tools.wmflabs.org/author-disambiguator/ - and since a few hours ago it seems to be constantly freezing. I've restarted it 3 times today already. It runs ok for a few minutes, but then at some point when I try to connect it just hangs forever. I've waited up to 30 minutes on a page that should respond in seconds, and gotten nothing back. This has worked fine up until today. It's using the php7.2 kubernetes image - did something change very recently? Or was there some other recent cloud/tools change that could cause this?
Arthur
Dear Arthur,
Sometimes the toolforge services just freeze up. Mine do all the time. Sometimes you just have to wait for them to come back on and it is usually fine. If it is not better I would go to #wikimedia-cloud on freenode IRC and ask there by saying "!help" to ping the admins. I hope my answer helps you!
On Tue, Dec 17, 2019 at 1:50 PM Arthur Smith arthurpsmith@gmail.com wrote:
Is this the right place to ask this?
Sure! Our #wikimedia-cloud support channel on the Freenode IRC network is also a good place for this general class of question. You know that by now though as we talked there briefly.
I run the wikidata author-disambiguator - https://tools.wmflabs.org/author-disambiguator/ - and since a few hours ago it seems to be constantly freezing. I've restarted it 3 times today already. It runs ok for a few minutes, but then at some point when I try to connect it just hangs forever. I've waited up to 30 minutes on a page that should respond in seconds, and gotten nothing back.
Does this page have dependencies on file system access? databases? external api calls?
This has worked fine up until today. It's using the php7.2 kubernetes image - did something change very recently? Or was there some other recent cloud/tools change that could cause this?
There have been no major changes to the Kubernetes image or other Toolforge infrastructure that should affect PHP webservice performance. We did make some changes today to the ingress HTTPS server that handles routing requests for https://tools.wmflabs.org/<tool> to the backend Kubernetes pod or grid engine job. These changes should be transparent to you however. The change was adding in support for routing requests to pods running on a new Kubernetes cluster that folks will learn more about in early January.
Bryan
Thanks! One response to your questions below...
On Tue, Dec 17, 2019 at 6:18 PM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Dec 17, 2019 at 1:50 PM Arthur Smith arthurpsmith@gmail.com wrote:
[...]
I run the wikidata author-disambiguator -
https://tools.wmflabs.org/author-disambiguator/ - and since a few hours ago it seems to be constantly freezing. I've restarted it 3 times today already. It runs ok for a few minutes, but then at some point when I try to connect it just hangs forever. I've waited up to 30 minutes on a page that should respond in seconds, and gotten nothing back.
Does this page have dependencies on file system access? databases? external api calls?
File system access only in the sense that it needs to read php page and library files, and the web server writes to log files. It does use its own database in the "tools" database server, but I experienced the problem even with pages that don't touch the database - even with just a bare "phpinfo()" call. I'm going to check access logs to see if something might be tying up all the php fastcgi processes for some reason. Except when I look at them via kubectl exec the processes seem to always be asleep (looking at /proc/xxx/status).
Most of the time spent normally in responding to requests with these pages is in various Wikidata API calls, so if those were much slower for some reason that might also be a cause - but I wouldn't have thought just restarting the server would speed things up again (for a few minutes) if that was the problem. I can't do any more on it today but will be looking into this more tomorrow (Wednesday) morning...
Arthur
Just to follow up on this - the service seems to have returned to normal for now. The best I can guess is that it was swamped with requests and just not able to keep up for a day or so there. Unfortunately the access.log file was not being retained; I've restored that so hopefully I can have a better idea what's going on when something like this happens next time. If anybody has good suggestions for how to monitor toolforge web services in kubernetes I'd definitely be interested!
Arthur
On Tue, Dec 17, 2019 at 9:21 PM Arthur Smith arthurpsmith@gmail.com wrote:
Thanks! One response to your questions below...
On Tue, Dec 17, 2019 at 6:18 PM Bryan Davis bd808@wikimedia.org wrote:
On Tue, Dec 17, 2019 at 1:50 PM Arthur Smith arthurpsmith@gmail.com wrote:
[...]
I run the wikidata author-disambiguator -
https://tools.wmflabs.org/author-disambiguator/ - and since a few hours ago it seems to be constantly freezing. I've restarted it 3 times today already. It runs ok for a few minutes, but then at some point when I try to connect it just hangs forever. I've waited up to 30 minutes on a page that should respond in seconds, and gotten nothing back.
Does this page have dependencies on file system access? databases? external api calls?
File system access only in the sense that it needs to read php page and library files, and the web server writes to log files. It does use its own database in the "tools" database server, but I experienced the problem even with pages that don't touch the database - even with just a bare "phpinfo()" call. I'm going to check access logs to see if something might be tying up all the php fastcgi processes for some reason. Except when I look at them via kubectl exec the processes seem to always be asleep (looking at /proc/xxx/status).
Most of the time spent normally in responding to requests with these pages is in various Wikidata API calls, so if those were much slower for some reason that might also be a cause - but I wouldn't have thought just restarting the server would speed things up again (for a few minutes) if that was the problem. I can't do any more on it today but will be looking into this more tomorrow (Wednesday) morning...
Arthur