I (GLM) edited in a few decisions to be made a couple days ago on our Wiki page, and JT put in some ideas. I went to edit in a response, but then realized that communicating via Wiki edits seems like a tremendously silly idea when we have a mailing list.

The current requirements of the project include distribution of execution and the guarantee that if at least one server is up, a task will be ran. I'm far from an expert in terms of this (or anything for that matter), but there are a few (perhaps naive) concerns that I have with the requirements.

From my understanding of how networking works, there's no way to guarantee that a node is down. So I'm worried about the following scenarios:

Due to networking problems, server A cannot communicate with server B. A has priority for running a task. Since they cannot communicate, B never learns that A completed the task. So B runs it too. => How much of a problem is it if a task runs multiple times?

A large number of servers are down, and the one that is up has low priority for running a task. There's necessarily a delayed execution because the running server has to wait for all the others to time out. This can be somewhat mitigated by keeping a list of nodes that are up, but that can lead to an out of date list resulting in the previous problem. => How delayed can running time be? The response on the Wiki mentioned administrators adjusting times. Note that this likely involves making the crontabs non-standard.

An updated crontab is created and propagates through to the servers. One server is completely disconnected though, so it doesn't receive the new table and keeps running old commands. => How bad is it if a deleted job runs? I'm actually not that worried about this; I think it's reasonable to expect the sysadmin to make sure all servers get the new cron table.

I suppose the point I'm trying to make is that if you want it to be fault tolerant and live down to a single server, it seems like you can run into duplication or late tasks if the network isn't perfect. Is there anything I'm missing? I'm certainly not discounting the possibility that there's a solution (whether clever or simple) to these problems; I just don't see it.

Thanks!

-Greg

P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now