Hello All,
I was hoping that we could continue to communicate all of ideas and
progress, even though we are far away from each other. Nothing is as
convenient as being next to each other, however I would like to setup some
form of scheduled discussion over the internet. Here is the link to the
scheduler I created:
http://when2meet.com/?1553741-9N2L8
Feel free to add any chunk of time that you would like, being mindful that
these times are ET. Also, any preference or ideas on how this conference
should be carried is open to discussion.
Sidenote, there was originally a bug still in the performance, however I am
unable to run the code as my machine is spitting out an error on the
deque.popleft() for a reason I have not investigated. That being said, I'm
unsure if that bug is still there.
Cheers,
Favian Contreras
Does someone want to add the "order of operations" for a newcomer to try
our project out? I am not a newcomer, and still I do not know what has and
hasn't been done by default in terms of setting up a new job to run. We run
the worker in one tab, scheduler in another tab; do we need to enter a job
into the user interface to test correct installation/configuration, or is
there
already one by default etc.
Also, Marc, in order to build the deb package using py2deb, I think I will
need super user permissions on a debian virtual machine. I am
encountering errors when trying to invoke
stdeb<https://pypi.python.org/pypi/stdeb>
commands.
Cheers,
John
Hello All,
Here is the Git page for the project:
https://github.com/BigFav/MegaCron
Feel free to fork the repo and link it to this one, so that you can pull
request onto this one. Also send me your Git names, either through email or
over the list serve, so that I may add you all as contributors to the repo.
Cheers,
Favian Contreras
Hello team,
I note that there has been little investigation of what Academia has to
say on the topic; here are a few suggested bits of reading that might
prove profitable as they explore the same problem space:
Lin, Xiaojun, and Shahzada B. Rasool. "Constant-time distributed
scheduling policies for ad hoc wireless networks." Decision and Control,
2006 45th IEEE Conference on. IEEE, 2006.
Abawajy, Jemal H. "Fault-tolerant scheduling policy for grid computing
systems." Parallel and Distributed Processing Symposium, 2004.
Proceedings. 18th International. IEEE, 2004.
And, while older, still has good insight:
Bannister, Joseph A., and Kishor S. Trivedi. "Task allocation in
fault-tolerant distributed systems." Acta Informatica 20.3 (1983): 261-281.
I am certain that more can be found; one useful research hint I can give
you is that those three papers are cited often in later papers, taking a
look at those is likely to be profitable.
-- Marc A. Pelletier
Hello, team!
I'm glad to see you're picking up the mailing list for architectural
decision making (win!); but you might want to familiarize yourself with:
http://linux.sgms-centre.com/misc/netiquette.php
In particular, point 9 is especially salient here. The issue isn't just
one of courtesy, but as a project grow or when you join an extant
project with an active developer community, you will find that those
conventions tend to be the "social grease" that allows a high-volume
mailing list to remain reasonably useful despite the decreasing signal
to noise ratio.
On the substantive front, you've been throwing around some really good
ideas on the wiki; and I was wondering how much /technical/ guidance you
wanted? I'm happy with letting you proceed independently and arrive
unguided at a design -- keeping in mind that we are aiming for a first
prototype in March at the latest -- but I will gladly help with
constraining the solution set if you feel you are floundering. Do not
hesitate to ask for help!
I look forward to meeting you and working up a big step forward next
weekend!
-- Marc A. Pelletier
*I suppose that this kind of prompts the question of what type of jobs
we'llbe running.*
I think the goal is to be able to handle any arbitrary job.
*If the jobs involve making changes to these servers themselves, then it
seems kind of arbitrary that we'd want to split up execution- just let
eachserver handle its own stuff. *
I am under the impression that we will want to keep execution to a single
machine per job, unless it fails, then another one takes the job over in
its entirety. Otherwise, if we're distributing single tasks between
multiple machines, things will get very complicated.
*If we have a gatekeeper server, then everything relies on that. If
thatgoes down, or a link between that and the server goes down, then
nothing can get done, and the gatekeeper can potentially be a bottleneck.
Maybe the servers elect a leader, but even then, you'll need a majority
ofservers to be up in order to pick something. You can't get down to one
server.*
Good point. Say we have a hypothetical build job that needs to get compiled
and moved to a particular folder to be accessed by its users. Some machine
(with several on standby in case it fails) after building, will eventually
need to execute the final step of moving the completed build to its proper
destination folder. Perhaps if a machine fails to respond to server
communications nearing this closing step of the job (making its work
public), we will still need a new elected leader to restart the job from
the beginning. If the first machine starts communicating again, they both
run and the first to finish causes the other to abort. Cleanup /
inconsistency becomes an issue in half-finished publishing steps, however.
John
On Mon, Feb 3, 2014 at 6:49 PM, <foa-cron-request(a)lists.wikimedia.org>wrote:
> Send Foa-cron mailing list submissions to
> foa-cron(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/foa-cron
> or, via email, send a message with subject or body 'help' to
> foa-cron-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> foa-cron-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Foa-cron digest..."
>
>
> Today's Topics:
>
> 1. Re: Foa-cron Digest, Vol 2, Issue 1 (John Tanner)
> 2. Re: Fault Tolerant vs. Live (Gregory Manis)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 3 Feb 2014 13:44:08 -0800
> From: John Tanner <johntanner(a)gmail.com>
> To: foa-cron(a)lists.wikimedia.org
> Subject: Re: [Foa-cron] Foa-cron Digest, Vol 2, Issue 1
> Message-ID:
> <
> CAOw8P7CY7BdTnehpNG+dCLE64RjaJqd0bwKdEh1POByJT9pe0g(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I am far from an expert as well, but here are my thoughts regarding the
> first point (just brainstorming)..
>
>
>
>
> * Due to networking problems, server A cannot communicate with server B.
> Ahas priority for running a task. Since they cannot communicate, B never
> learns that A completed the task. So B runs it too. => How much of aproblem
> is it if a task runs multiple times?*
>
> In order for it to be a problem that a task ends up running multiple times,
> there must be some sort of communication between the servers involved. Only
> once Server A says "I'm running the job" or "I'm done", and Server B
> acknowledges, do we have a known duplicate task. If Server B has not
> finished the job, it aborts. If Server B has finished the job, an "I'm
> done" message from Server A/B should result in changes be propagated by
> *either* Server A or Server B, mutually exclusively.
>
> The key is that the final set of changes brought about by a particular
> server should only be synced after completion, and can only occur after
> successful network communication (otherwise, how can it propagate to
> anyone?). This seems to call for a necessary third server acting as a sort
> of gatekeeper.
>
> In the worst case, Server A and Server B have completed the identical task
> in isolation, and nothing needs to change. One of them will not propagate
> their effects (ie. generation of file, sending of an email, compilation of
> source code) past the gatekeeper server, which will subsequently release
> the effects to those who require it.
>
> This, however, poses interesting questions on how to determine and
> communicate which changes need to be propagated by any given cronjob.
>
> Cheers,
> John
>
>
> On Mon, Feb 3, 2014 at 4:01 AM, <foa-cron-request(a)lists.wikimedia.org
> >wrote:
>
> > Send Foa-cron mailing list submissions to
> > foa-cron(a)lists.wikimedia.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > https://lists.wikimedia.org/mailman/listinfo/foa-cron
> > or, via email, send a message with subject or body 'help' to
> > foa-cron-request(a)lists.wikimedia.org
> >
> > You can reach the person managing the list at
> > foa-cron-owner(a)lists.wikimedia.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Foa-cron digest..."
> >
> >
> > Today's Topics:
> >
> > 1. Fault Tolerant vs Live (Gregory Manis)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Mon, 3 Feb 2014 02:15:30 -0500
> > From: Gregory Manis <glm79(a)cornell.edu>
> > To: foa-cron(a)lists.wikimedia.org
> > Subject: [Foa-cron] Fault Tolerant vs Live
> > Message-ID:
> > <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ=
> > Uw(a)mail.gmail.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > I (GLM) edited in a few decisions to be made a couple days ago on our
> Wiki
> > page, and JT put in some ideas. I went to edit in a response, but then
> > realized that communicating via Wiki edits seems like a tremendously
> silly
> > idea when we have a mailing list.
> >
> > The current requirements of the project include distribution of execution
> > and the guarantee that if at least one server is up, a task will be ran.
> > I'm far from an expert in terms of this (or anything for that matter),
> but
> > there are a few (perhaps naive) concerns that I have with the
> requirements.
> >
> > >From my understanding of how networking works, there's no way to
> guarantee
> > that a node is down. So I'm worried about the following scenarios:
> >
> > Due to networking problems, server A cannot communicate with server B. A
> > has priority for running a task. Since they cannot communicate, B never
> > learns that A completed the task. So B runs it too. => How much of a
> > problem is it if a task runs multiple times?
> >
> > A large number of servers are down, and the one that is up has low
> priority
> > for running a task. There's necessarily a delayed execution because the
> > running server has to wait for all the others to time out. This can be
> > somewhat mitigated by keeping a list of nodes that are up, but that can
> > lead to an out of date list resulting in the previous problem. => How
> > delayed can running time be? The response on the Wiki mentioned
> > administrators adjusting times. Note that this likely involves making the
> > crontabs non-standard.
> >
> > An updated crontab is created and propagates through to the servers. One
> > server is completely disconnected though, so it doesn't receive the new
> > table and keeps running old commands. => How bad is it if a deleted job
> > runs? I'm actually not that worried about this; I think it's reasonable
> to
> > expect the sysadmin to make sure all servers get the new cron table.
> >
> > I suppose the point I'm trying to make is that if you want it to be fault
> > tolerant and live down to a single server, it seems like you can run into
> > duplication or late tasks if the network isn't perfect. Is there anything
> > I'm missing? I'm certainly not discounting the possibility that there's a
> > solution (whether clever or simple) to these problems; I just don't see
> it.
> >
> > Thanks!
> >
> > -Greg
> >
> > P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now
> >
Thanks for the response, John.
I suppose that this kind of prompts the question of what type of jobs we'll
be running.
If the jobs involve making changes to these servers themselves, then it
seems kind of arbitrary that we'd want to split up execution- just let each
server handle its own stuff.
If we have a gatekeeper server, then everything relies on that. If that
goes down, or a link between that and the server goes down, then nothing
can get done, and the gatekeeper can potentially be a bottleneck.
Maybe the servers elect a leader, but even then, you'll need a majority of
servers to be up in order to pick something. You can't get down to one
server.
-Greg
On Mon, Feb 3, 2014 at 4:44 PM, John Tanner <johntanner(a)gmail.com> wrote:
> I am far from an expert as well, but here are my thoughts regarding the
> first point (just brainstorming)..
>
>
>
>
> * Due to networking problems, server A cannot communicate with server B.
> Ahas priority for running a task. Since they cannot communicate, B never
> learns that A completed the task. So B runs it too. => How much of aproblem
> is it if a task runs multiple times?*
>
> In order for it to be a problem that a task ends up running multiple
> times, there must be some sort of communication between the servers
> involved. Only once Server A says "I'm running the job" or "I'm done", and
> Server B acknowledges, do we have a known duplicate task. If Server B has
> not finished the job, it aborts. If Server B has finished the job, an "I'm
> done" message from Server A/B should result in changes be propagated by
> *either* Server A or Server B, mutually exclusively.
>
> The key is that the final set of changes brought about by a particular
> server should only be synced after completion, and can only occur after
> successful network communication (otherwise, how can it propagate to
> anyone?). This seems to call for a necessary third server acting as a sort
> of gatekeeper.
>
> In the worst case, Server A and Server B have completed the identical task
> in isolation, and nothing needs to change. One of them will not propagate
> their effects (ie. generation of file, sending of an email, compilation of
> source code) past the gatekeeper server, which will subsequently release
> the effects to those who require it.
>
> This, however, poses interesting questions on how to determine and
> communicate which changes need to be propagated by any given cronjob.
>
> Cheers,
> John
>
>
> On Mon, Feb 3, 2014 at 4:01 AM, <foa-cron-request(a)lists.wikimedia.org>wrote:
>
>> Send Foa-cron mailing list submissions to
>> foa-cron(a)lists.wikimedia.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://lists.wikimedia.org/mailman/listinfo/foa-cron
>> or, via email, send a message with subject or body 'help' to
>> foa-cron-request(a)lists.wikimedia.org
>>
>> You can reach the person managing the list at
>> foa-cron-owner(a)lists.wikimedia.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Foa-cron digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Fault Tolerant vs Live (Gregory Manis)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 3 Feb 2014 02:15:30 -0500
>> From: Gregory Manis <glm79(a)cornell.edu>
>> To: foa-cron(a)lists.wikimedia.org
>> Subject: [Foa-cron] Fault Tolerant vs Live
>> Message-ID:
>> <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ=
>> Uw(a)mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I (GLM) edited in a few decisions to be made a couple days ago on our Wiki
>> page, and JT put in some ideas. I went to edit in a response, but then
>> realized that communicating via Wiki edits seems like a tremendously silly
>> idea when we have a mailing list.
>>
>> The current requirements of the project include distribution of execution
>> and the guarantee that if at least one server is up, a task will be ran.
>> I'm far from an expert in terms of this (or anything for that matter), but
>> there are a few (perhaps naive) concerns that I have with the
>> requirements.
>>
>> >From my understanding of how networking works, there's no way to
>> guarantee
>> that a node is down. So I'm worried about the following scenarios:
>>
>> Due to networking problems, server A cannot communicate with server B. A
>> has priority for running a task. Since they cannot communicate, B never
>> learns that A completed the task. So B runs it too. => How much of a
>> problem is it if a task runs multiple times?
>>
>> A large number of servers are down, and the one that is up has low
>> priority
>> for running a task. There's necessarily a delayed execution because the
>> running server has to wait for all the others to time out. This can be
>> somewhat mitigated by keeping a list of nodes that are up, but that can
>> lead to an out of date list resulting in the previous problem. => How
>> delayed can running time be? The response on the Wiki mentioned
>> administrators adjusting times. Note that this likely involves making the
>> crontabs non-standard.
>>
>> An updated crontab is created and propagates through to the servers. One
>> server is completely disconnected though, so it doesn't receive the new
>> table and keeps running old commands. => How bad is it if a deleted job
>> runs? I'm actually not that worried about this; I think it's reasonable to
>> expect the sysadmin to make sure all servers get the new cron table.
>>
>> I suppose the point I'm trying to make is that if you want it to be fault
>> tolerant and live down to a single server, it seems like you can run into
>> duplication or late tasks if the network isn't perfect. Is there anything
>> I'm missing? I'm certainly not discounting the possibility that there's a
>> solution (whether clever or simple) to these problems; I just don't see
>> it.
>>
>> Thanks!
>>
>> -Greg
>>
>> P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now
>>
I am far from an expert as well, but here are my thoughts regarding the
first point (just brainstorming)..
* Due to networking problems, server A cannot communicate with server B.
Ahas priority for running a task. Since they cannot communicate, B never
learns that A completed the task. So B runs it too. => How much of aproblem
is it if a task runs multiple times?*
In order for it to be a problem that a task ends up running multiple times,
there must be some sort of communication between the servers involved. Only
once Server A says "I'm running the job" or "I'm done", and Server B
acknowledges, do we have a known duplicate task. If Server B has not
finished the job, it aborts. If Server B has finished the job, an "I'm
done" message from Server A/B should result in changes be propagated by
*either* Server A or Server B, mutually exclusively.
The key is that the final set of changes brought about by a particular
server should only be synced after completion, and can only occur after
successful network communication (otherwise, how can it propagate to
anyone?). This seems to call for a necessary third server acting as a sort
of gatekeeper.
In the worst case, Server A and Server B have completed the identical task
in isolation, and nothing needs to change. One of them will not propagate
their effects (ie. generation of file, sending of an email, compilation of
source code) past the gatekeeper server, which will subsequently release
the effects to those who require it.
This, however, poses interesting questions on how to determine and
communicate which changes need to be propagated by any given cronjob.
Cheers,
John
On Mon, Feb 3, 2014 at 4:01 AM, <foa-cron-request(a)lists.wikimedia.org>wrote:
> Send Foa-cron mailing list submissions to
> foa-cron(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/foa-cron
> or, via email, send a message with subject or body 'help' to
> foa-cron-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> foa-cron-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Foa-cron digest..."
>
>
> Today's Topics:
>
> 1. Fault Tolerant vs Live (Gregory Manis)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 3 Feb 2014 02:15:30 -0500
> From: Gregory Manis <glm79(a)cornell.edu>
> To: foa-cron(a)lists.wikimedia.org
> Subject: [Foa-cron] Fault Tolerant vs Live
> Message-ID:
> <CAFe==+-yLy_siAFm2Ya7cATCOJYCvHA1_CfN0TAmzW4DGGZ=
> Uw(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I (GLM) edited in a few decisions to be made a couple days ago on our Wiki
> page, and JT put in some ideas. I went to edit in a response, but then
> realized that communicating via Wiki edits seems like a tremendously silly
> idea when we have a mailing list.
>
> The current requirements of the project include distribution of execution
> and the guarantee that if at least one server is up, a task will be ran.
> I'm far from an expert in terms of this (or anything for that matter), but
> there are a few (perhaps naive) concerns that I have with the requirements.
>
> >From my understanding of how networking works, there's no way to guarantee
> that a node is down. So I'm worried about the following scenarios:
>
> Due to networking problems, server A cannot communicate with server B. A
> has priority for running a task. Since they cannot communicate, B never
> learns that A completed the task. So B runs it too. => How much of a
> problem is it if a task runs multiple times?
>
> A large number of servers are down, and the one that is up has low priority
> for running a task. There's necessarily a delayed execution because the
> running server has to wait for all the others to time out. This can be
> somewhat mitigated by keeping a list of nodes that are up, but that can
> lead to an out of date list resulting in the previous problem. => How
> delayed can running time be? The response on the Wiki mentioned
> administrators adjusting times. Note that this likely involves making the
> crontabs non-standard.
>
> An updated crontab is created and propagates through to the servers. One
> server is completely disconnected though, so it doesn't receive the new
> table and keeps running old commands. => How bad is it if a deleted job
> runs? I'm actually not that worried about this; I think it's reasonable to
> expect the sysadmin to make sure all servers get the new cron table.
>
> I suppose the point I'm trying to make is that if you want it to be fault
> tolerant and live down to a single server, it seems like you can run into
> duplication or late tasks if the network isn't perfect. Is there anything
> I'm missing? I'm certainly not discounting the possibility that there's a
> solution (whether clever or simple) to these problems; I just don't see it.
>
> Thanks!
>
> -Greg
>
> P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now
>
I (GLM) edited in a few decisions to be made a couple days ago on our Wiki
page, and JT put in some ideas. I went to edit in a response, but then
realized that communicating via Wiki edits seems like a tremendously silly
idea when we have a mailing list.
The current requirements of the project include distribution of execution
and the guarantee that if at least one server is up, a task will be ran.
I'm far from an expert in terms of this (or anything for that matter), but
there are a few (perhaps naive) concerns that I have with the requirements.
>From my understanding of how networking works, there's no way to guarantee
that a node is down. So I'm worried about the following scenarios:
Due to networking problems, server A cannot communicate with server B. A
has priority for running a task. Since they cannot communicate, B never
learns that A completed the task. So B runs it too. => How much of a
problem is it if a task runs multiple times?
A large number of servers are down, and the one that is up has low priority
for running a task. There's necessarily a delayed execution because the
running server has to wait for all the others to time out. This can be
somewhat mitigated by keeping a list of nodes that are up, but that can
lead to an out of date list resulting in the previous problem. => How
delayed can running time be? The response on the Wiki mentioned
administrators adjusting times. Note that this likely involves making the
crontabs non-standard.
An updated crontab is created and propagates through to the servers. One
server is completely disconnected though, so it doesn't receive the new
table and keeps running old commands. => How bad is it if a deleted job
runs? I'm actually not that worried about this; I think it's reasonable to
expect the sysadmin to make sure all servers get the new cron table.
I suppose the point I'm trying to make is that if you want it to be fault
tolerant and live down to a single server, it seems like you can run into
duplication or late tasks if the network isn't perfect. Is there anything
I'm missing? I'm certainly not discounting the possibility that there's a
solution (whether clever or simple) to these problems; I just don't see it.
Thanks!
-Greg
P.S. Sorry if this email is totally incoherent; it's 2:15 AM right now