I got the email below telling me that my cron job running as william-avery-bot had throw an error, and I noticed that the Grid job that it kicks off hasn't run since.

I tried deleting the job using the instructions at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99 but it appeared "stuck".

"qstat -xml" outputs the following:
<?xml version='1.0'?>
<job_info  xmlns:xsd="http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qstat/qstat.xsd">
    <job_list state="running">

But when I ssh to tools-sgeexec-0916.tools.eqiad.wmflabs I see no sign of any processes under tools.william-avery-bot, except the ones associated with my interactive session.

Can anyone help resolve this or advise of a venue to raise it?

Thanks in advance,


---------- Forwarded message ---------
From: Cron Daemon <root@tools.wmflabs.org>
Date: Thu, 25 Mar 2021 at 16:49
Subject: Cron <tools.william-avery-bot@tools-sgecron-01> /usr/bin/jsub -N cron-TaxonbarSyncerBot -once -quiet ~/TaxonbarSyncerBot.sh
To: <tools.william-avery-bot@tools.wmflabs.org>

error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud": got send error
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0