On Tue, Jan 7, 2020 at 4:26 PM Maciej Jaros <egil(a)wp.pl> wrote:
The problem is I didn't shut it down. So what did?
Routine maintenance on the grid engine nodes. The shutdown timestamps
line up with this SAL entry [0]: "22:58 <bd808> Depooling
tools-sgewebgrid-lighttpd-090[2-9]".
The depooling process is intended to restart running webservice
workloads on new nodes in the cluster, but apparently in this case it
did not. Sadly this is not horribly surprising. Grid engine is not
very good at tracking system state compared to the Kubernetes cluster
in Toolforge.
If your tool is capable of running on our Kubernetes system (uses one
language runtime and does not rely on special software installed
globally) then migrating from Grid Engine to Kubernetes will almost
certainly leave you with a more stable webservice. See the Wikitech
page on the last Grid Engine migration [1] for some hints on how to
migrate.
[0]:
https://tools.wmflabs.org/sal/log/AW99FPYQfYQT6VcDfz3h
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_…
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808