Hello,
Today we have disabled BigBrother in Toolforge. BigBrother was a tool
that monitored continuous jobs that failed to get restarted because they
ran into corner cases where Grid Engine wasn't sufficiently smart to
re-start them (e.g. out of memory). BigBrother would continuously
monitor those jobs and duplicate that functionality on a layer above
Grid Engine.
Although very few tools used BigBrother (0.65% to be more precise), it
taxed our NFS file server constantly so keeping it around didn't make
much sense. Additionally, its functionality could be easily implemented
with a shell script running from cron.
So we've converted all tools that had a .bigbrotherrc file to using a
bigbrother.sh script that is triggered every 5min to restart jobs. If
your tool used BigBrother, please check your crontab (`crontab -l`) and
will see a few entries like this:
```
# Ensure continuous jobs are running
*/5 * * * * jlocal /data/project/tool_name/bigbrother.sh job_name job_script
```
Documentation has also been updated to reflect this change:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother_(Depreca…
In our tests everything worked fine but please let us know if your
tool is being impacted by this change.
Regards,
--
Giovanni Tirloni
Operations Engineer
Wikimedia Cloud Services