Hi,
I got the email below telling me that my cron job running as william-avery-bot had throw an error, and I noticed that the Grid job that it kicks off hasn't run since.
I tried deleting the job using the instructions at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%... but it appeared "stuck".
"qstat -xml" outputs the following: <?xml version='1.0'?> <job_info xmlns:xsd=" http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qsta... "> <queue_info> <job_list state="running"> <JB_job_number>9999749</JB_job_number> <JAT_prio>0.25319</JAT_prio> <JB_name>cron-TaxonbarSyncerBot</JB_name> <JB_owner>tools.william-avery-bot</JB_owner> <state>dr</state> <JAT_start_time>2021-03-25T17:49:16</JAT_start_time> <queue_name>task@tools-sgeexec-0916.tools.eqiad.wmflabs</queue_name> <slots>1</slots> </job_list> </queue_info> <job_info> </job_info> </job_info>
But when I ssh to tools-sgeexec-0916.tools.eqiad.wmflabs I see no sign of any processes under tools.william-avery-bot, except the ones associated with my interactive session.
Can anyone help resolve this or advise of a venue to raise it?
Thanks in advance,
Will
---------- Forwarded message --------- From: Cron Daemon root@tools.wmflabs.org Date: Thu, 25 Mar 2021 at 16:49 Subject: Cron tools.william-avery-bot@tools-sgecron-01 /usr/bin/jsub -N cron-TaxonbarSyncerBot -once -quiet ~/TaxonbarSyncerBot.sh To: tools.william-avery-bot@tools.wmflabs.org
error: commlib error: got select error (Connection refused) error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud": got send error Traceback (most recent call last): File "/usr/bin/job", line 48, in <module> root = xml.etree.ElementTree.fromstring(proc.stdout.read()) File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML return parser.close() xml.etree.ElementTree.ParseError: no element found: line 1, column 0