I would support any and all temporary "ad hoc" measures that make sense to the other developers to kick start the server automatically from time to time.
For example, envision a cron job that semi-intelligently (or semi-stupidly!) hunts for those runaway killer processes and just kill -9's them. Whatever someone is doing to kill the machine, they should stop.
Of course, we should *also* hunt down and resolve the problems that lead to this, but really, it's just as important to keep this puppy humming along.
wikitech-l@lists.wikimedia.org