On Tue, Jan 7, 2020 at 4:26 PM Maciej Jaros egil@wp.pl wrote:
The problem is I didn't shut it down. So what did?
Routine maintenance on the grid engine nodes. The shutdown timestamps line up with this SAL entry [0]: "22:58 <bd808> Depooling tools-sgewebgrid-lighttpd-090[2-9]".
The depooling process is intended to restart running webservice workloads on new nodes in the cluster, but apparently in this case it did not. Sadly this is not horribly surprising. Grid engine is not very good at tracking system state compared to the Kubernetes cluster in Toolforge.
If your tool is capable of running on our Kubernetes system (uses one language runtime and does not rely on special software installed globally) then migrating from Grid Engine to Kubernetes will almost certainly leave you with a more stable webservice. See the Wikitech page on the last Grid Engine migration [1] for some hints on how to migrate.
[0]: https://tools.wmflabs.org/sal/log/AW99FPYQfYQT6VcDfz3h [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a...
Bryan