*I suppose that this kind of prompts the question of what type of jobs we'llbe running.*
I think the goal is to be able to handle any arbitrary job.
*If the jobs involve making changes to these servers themselves, then it seems kind of arbitrary that we'd want to split up execution- just let eachserver handle its own stuff. * I am under the impression that we will want to keep execution to a single machine per job, unless it fails, then another one takes the job over in its entirety. Otherwise, if we're distributing single tasks between multiple machines, things will get very complicated.
*If we have a gatekeeper server, then everything relies on that. If thatgoes down, or a link between that and the server goes down, then nothing can get done, and the gatekeeper can potentially be a bottleneck. Maybe the servers elect a leader, but even then, you'll need a majority ofservers to be up in order to pick something. You can't get down to one server.*
Good point. Say we have a hypothetical build job that needs to get compiled and moved to a particular folder to be accessed by its users. Some machine (with several on standby in case it fails) after building, will eventually need to execute the final step of moving the completed build to its proper destination folder. Perhaps if a machine fails to respond to server communications nearing this closing step of the job (making its work public), we will still need a new elected leader to restart the job from the beginning. If the first machine starts communicating again, they both run and the first to finish causes the other to abort. Cleanup / inconsistency becomes an issue in half-finished publishing steps, however.
John
On Mon, Feb 3, 2014 at 6:49 PM, foa-cron-request@lists.wikimedia.orgwrote:
Send Foa-cron mailing list submissions to foa-cron@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/foa-cron or, via email, send a message with subject or body 'help' to foa-cron-request@lists.wikimedia.org
You can reach the person managing the list at foa-cron-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Foa-cron digest..."
Today's Topics:
- Re: Foa-cron Digest, Vol 2, Issue 1 (John Tanner)
- Re: Fault Tolerant vs. Live (Gregory Manis)
Message: 1 Date: Mon, 3 Feb 2014 13:44:08 -0800 From: John Tanner johntanner@gmail.com To: foa-cron@lists.wikimedia.org Subject: Re: [Foa-cron] Foa-cron Digest, Vol 2, Issue 1 Message-ID: < CAOw8P7CY7BdTnehpNG+dCLE64RjaJqd0bwKdEh1POByJT9pe0g@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1"
I am far from an expert as well, but here are my thoughts regarding the first point (just brainstorming)..
- Due to networking problems, server A cannot communicate with server B.
Ahas priority for running a task. Since they cannot communicate, B never learns that A completed the task. So B runs it too. => How much of aproblem is it if a task runs multiple times?*
In order for it to be a problem that a task ends up running multiple times, there must be some sort of communication between the servers involved. Only once Server A says "I'm running the job" or "I'm done", and Server B acknowledges, do we have a known duplicate task. If Server B has not finished the job, it aborts. If Server B has finished the job, an "I'm done" message from Server A/B should result in changes be propagated by *either* Server A or Server B, mutually exclusively.
The key is that the final set of changes brought about by a particular server should only be synced after completion, and can only occur after successful network communication (otherwise, how can it propagate to anyone?). This seems to call for a necessary third server acting as a sort of gatekeeper.
In the worst case, Server A and Server B have completed the identical task in isolation, and nothing needs to change. One of them will not propagate their effects (ie. generation of file, sending of an email, compilation of source code) past the gatekeeper server, which will subsequently release the effects to those who require it.
This, however, poses interesting questions on how to determine and communicate which changes need to be propagated by any given cronjob.
Cheers, John
On Mon, Feb 3, 2014 at 4:01 AM, <foa-cron-request@lists.wikimedia.org
wrote:
Send Foa-cron mailing list submissions to foa-cron@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/foa-cron or, via email, send a message with subject or body 'help' to foa-cron-request@lists.wikimedia.org
You can reach the person managing the list at foa-cron-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Foa-cron digest..."
Today's Topics:
- Fault Tolerant vs Live (Gregory Manis)
Message: 1 Date: Mon, 3 Feb 2014 02:15:30 -0500 From: Gregory Manis glm79@cornell.edu To: foa-cron@lists.wikimedia.org Subject: [Foa-cron] Fault Tolerant vs Live Message-ID: <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ= Uw@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1"
I (GLM) edited in a few decisions to be made a couple days ago on our
Wiki
page, and JT put in some ideas. I went to edit in a response, but then realized that communicating via Wiki edits seems like a tremendously
silly
idea when we have a mailing list.
The current requirements of the project include distribution of execution and the guarantee that if at least one server is up, a task will be ran. I'm far from an expert in terms of this (or anything for that matter),
but
there are a few (perhaps naive) concerns that I have with the
requirements.
From my understanding of how networking works, there's no way to
guarantee
that a node is down. So I'm worried about the following scenarios:
Due to networking problems, server A cannot communicate with server B. A has priority for running a task. Since they cannot communicate, B never learns that A completed the task. So B runs it too. => How much of a problem is it if a task runs multiple times?
A large number of servers are down, and the one that is up has low
priority
for running a task. There's necessarily a delayed execution because the running server has to wait for all the others to time out. This can be somewhat mitigated by keeping a list of nodes that are up, but that can lead to an out of date list resulting in the previous problem. => How delayed can running time be? The response on the Wiki mentioned administrators adjusting times. Note that this likely involves making the crontabs non-standard.
An updated crontab is created and propagates through to the servers. One server is completely disconnected though, so it doesn't receive the new table and keeps running old commands. => How bad is it if a deleted job runs? I'm actually not that worried about this; I think it's reasonable
to
expect the sysadmin to make sure all servers get the new cron table.
I suppose the point I'm trying to make is that if you want it to be fault tolerant and live down to a single server, it seems like you can run into duplication or late tasks if the network isn't perfect. Is there anything I'm missing? I'm certainly not discounting the possibility that there's a solution (whether clever or simple) to these problems; I just don't see
it.
Thanks!
-Greg
P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now
I think the goal is to be able to handle any arbitrary job.
I am under the impression that we will want to keep execution to a single
machine per job, unless it fails, then another one takes the job over in its entirety. Otherwise, if we're distributing single tasks between multiple machines, things will get very complicated.
I suppose the point I was trying to make there was from my understanding, the jobs that the servers are going to complete probably aren't things that are editing properties of each other (otherwise each server would handle itself) but rather completing jobs that are affecting outside things- so you wouldn't necessarily have a propagation period.
Perhaps if a machine fails to respond to server communications nearing
this closing step of the job (making its work public)
The problem with this is that it would require all commands to be rewritten from a distributed cron perspective. Commands don't inherently have an almost-done state, and for some (like changing the properties of a SQL table) once you start executing, you're already changing front facing things.
______________________________________________________
I keep on thinking of weird degenerative scenarios that chances are will never happen. One possible problem with the majority electing a leader is as follows:
There are groups of servers in two data centers; n in location A, n+1 in location B. The leader is in location A. Something takes out the link between the two data centers. The servers in location B elect a new leader and proceed to do their jobs (say sending out emails). The servers in location A can all talk to the leader, so they also continue doing their jobs. Everyone gets two emails as a result.
-Greg
On Tue, Feb 4, 2014 at 12:09 AM, John Tanner johntanner@gmail.com wrote:
*I suppose that this kind of prompts the question of what type of jobs we'llbe running.*
I think the goal is to be able to handle any arbitrary job.
*If the jobs involve making changes to these servers themselves, then it seems kind of arbitrary that we'd want to split up execution- just let eachserver handle its own stuff. * I am under the impression that we will want to keep execution to a single machine per job, unless it fails, then another one takes the job over in its entirety. Otherwise, if we're distributing single tasks between multiple machines, things will get very complicated.
*If we have a gatekeeper server, then everything relies on that. If thatgoes down, or a link between that and the server goes down, then nothing can get done, and the gatekeeper can potentially be a bottleneck. Maybe the servers elect a leader, but even then, you'll need a majority ofservers to be up in order to pick something. You can't get down to one server.*
Good point. Say we have a hypothetical build job that needs to get compiled and moved to a particular folder to be accessed by its users. Some machine (with several on standby in case it fails) after building, will eventually need to execute the final step of moving the completed build to its proper destination folder. Perhaps if a machine fails to respond to server communications nearing this closing step of the job (making its work public), we will still need a new elected leader to restart the job from the beginning. If the first machine starts communicating again, they both run and the first to finish causes the other to abort. Cleanup / inconsistency becomes an issue in half-finished publishing steps, however.
John
On Mon, Feb 3, 2014 at 6:49 PM, foa-cron-request@lists.wikimedia.orgwrote:
Send Foa-cron mailing list submissions to foa-cron@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/foa-cron or, via email, send a message with subject or body 'help' to foa-cron-request@lists.wikimedia.org
You can reach the person managing the list at foa-cron-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Foa-cron digest..."
Today's Topics:
- Re: Foa-cron Digest, Vol 2, Issue 1 (John Tanner)
- Re: Fault Tolerant vs. Live (Gregory Manis)
Message: 1 Date: Mon, 3 Feb 2014 13:44:08 -0800 From: John Tanner johntanner@gmail.com To: foa-cron@lists.wikimedia.org Subject: Re: [Foa-cron] Foa-cron Digest, Vol 2, Issue 1 Message-ID: < CAOw8P7CY7BdTnehpNG+dCLE64RjaJqd0bwKdEh1POByJT9pe0g@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1"
I am far from an expert as well, but here are my thoughts regarding the first point (just brainstorming)..
- Due to networking problems, server A cannot communicate with server B.
Ahas priority for running a task. Since they cannot communicate, B never learns that A completed the task. So B runs it too. => How much of aproblem is it if a task runs multiple times?*
In order for it to be a problem that a task ends up running multiple times, there must be some sort of communication between the servers involved. Only once Server A says "I'm running the job" or "I'm done", and Server B acknowledges, do we have a known duplicate task. If Server B has not finished the job, it aborts. If Server B has finished the job, an "I'm done" message from Server A/B should result in changes be propagated by *either* Server A or Server B, mutually exclusively.
The key is that the final set of changes brought about by a particular server should only be synced after completion, and can only occur after successful network communication (otherwise, how can it propagate to anyone?). This seems to call for a necessary third server acting as a sort of gatekeeper.
In the worst case, Server A and Server B have completed the identical task in isolation, and nothing needs to change. One of them will not propagate their effects (ie. generation of file, sending of an email, compilation of source code) past the gatekeeper server, which will subsequently release the effects to those who require it.
This, however, poses interesting questions on how to determine and communicate which changes need to be propagated by any given cronjob.
Cheers, John
On Mon, Feb 3, 2014 at 4:01 AM, <foa-cron-request@lists.wikimedia.org
wrote:
Send Foa-cron mailing list submissions to foa-cron@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/foa-cron or, via email, send a message with subject or body 'help' to foa-cron-request@lists.wikimedia.org
You can reach the person managing the list at foa-cron-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Foa-cron digest..."
Today's Topics:
- Fault Tolerant vs Live (Gregory Manis)
Message: 1 Date: Mon, 3 Feb 2014 02:15:30 -0500 From: Gregory Manis glm79@cornell.edu To: foa-cron@lists.wikimedia.org Subject: [Foa-cron] Fault Tolerant vs Live Message-ID: <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ= Uw@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1"
I (GLM) edited in a few decisions to be made a couple days ago on our
Wiki
page, and JT put in some ideas. I went to edit in a response, but then realized that communicating via Wiki edits seems like a tremendously
silly
idea when we have a mailing list.
The current requirements of the project include distribution of
execution
and the guarantee that if at least one server is up, a task will be ran. I'm far from an expert in terms of this (or anything for that matter),
but
there are a few (perhaps naive) concerns that I have with the
requirements.
From my understanding of how networking works, there's no way to
guarantee
that a node is down. So I'm worried about the following scenarios:
Due to networking problems, server A cannot communicate with server B. A has priority for running a task. Since they cannot communicate, B never learns that A completed the task. So B runs it too. => How much of a problem is it if a task runs multiple times?
A large number of servers are down, and the one that is up has low
priority
for running a task. There's necessarily a delayed execution because the running server has to wait for all the others to time out. This can be somewhat mitigated by keeping a list of nodes that are up, but that can lead to an out of date list resulting in the previous problem. => How delayed can running time be? The response on the Wiki mentioned administrators adjusting times. Note that this likely involves making
the
crontabs non-standard.
An updated crontab is created and propagates through to the servers. One server is completely disconnected though, so it doesn't receive the new table and keeps running old commands. => How bad is it if a deleted job runs? I'm actually not that worried about this; I think it's reasonable
to
expect the sysadmin to make sure all servers get the new cron table.
I suppose the point I'm trying to make is that if you want it to be
fault
tolerant and live down to a single server, it seems like you can run
into
duplication or late tasks if the network isn't perfect. Is there
anything
I'm missing? I'm certainly not discounting the possibility that there's
a
solution (whether clever or simple) to these problems; I just don't see
it.
Thanks!
-Greg
P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now