Just to follow up on this - the service seems to have returned to normal for now. The best I can guess is that it was swamped with requests and just not able to keep up for a day or so there. Unfortunately the access.log file was not being retained; I've restored that so hopefully I can have a better idea what's going on when something like this happens next time. If anybody has good suggestions for how to monitor toolforge web services in kubernetes I'd definitely be interested!

   Arthur

On Tue, Dec 17, 2019 at 9:21 PM Arthur Smith <arthurpsmith@gmail.com> wrote:
Thanks! One response to your questions below...

On Tue, Dec 17, 2019 at 6:18 PM Bryan Davis <bd808@wikimedia.org> wrote:
On Tue, Dec 17, 2019 at 1:50 PM Arthur Smith <arthurpsmith@gmail.com> wrote:
> [...]

> I run the wikidata author-disambiguator - https://tools.wmflabs.org/author-disambiguator/ - and since a few hours ago it seems to be constantly freezing. I've restarted it 3 times today already. It runs ok for a few minutes, but then at some point when I try to connect it just hangs forever. I've waited up to 30 minutes on a page that should respond in seconds, and gotten nothing back.

Does this page have dependencies on file system access? databases?
external api calls?


File system access only in the sense that it needs to read php page and library files, and the web server writes to log files. It does use its own database in the "tools" database server, but I experienced the problem even with pages that don't touch the database - even with just a bare "phpinfo()" call. I'm going to check access logs to see if something might be tying up all the php fastcgi processes for some reason. Except when I look at them via kubectl exec the processes seem to always be asleep (looking at /proc/xxx/status).

Most of the time spent normally in responding to requests with these pages is in various Wikidata API calls, so if those were much slower for some reason that might also be a cause - but I wouldn't have thought just restarting the server would speed things up again (for a few minutes) if that was the problem. I can't do any more on it today but will be looking into this more tomorrow (Wednesday) morning...

    Arthur