...
- Availability and Resilience are about the robustness of an architecture for a number of outage scenarios.
- Master / Agent Availability includes a number of architecture decisions:
- Master Clusters provide redundancy of Master instances in a network.
- Agent Bundles can be used to compensate the outage of a server that runs an Agent.
- Master / Agent Resilience includes a number of implicit and explicit measures for:
- Master / Agent Reconciliation allows continued execution of tasks in case of short-term Network Connection Loss.
- Master Service Recovery includes supported measures after a Master Service Failure.
- Database Service Recovery includes the capability to recover in case of Database Connection Loss.
...
Feature
- Reconciliation Scenario
- applies Applies after a Network Connection Loss between Master and Agent.
- includes Includes 5 attempts to re-establishing establish the normal relationship between Master and Agent after a connection loss for a number of times. A delay of less than 1s is assumed between retry attempts.
- Agent Behavior
- By default an Agent will kill any running tasks immediately if the connection to the Master gets lost, i.e. none of the above scenarios is supported (JS-1523). The reasons for this are:
- If a Master were not available for a longer period then the Agent could not report back the execution history and log information for tasks. This would result in the fact that no information is available with the Master if the job execution has been successful or not.
- The primary goal is to prevent duplicate simultaneous execution of jobs. Without further information from a Master the respective Agent instance cannot know if later on it will be contacted for re-execution of the same job (which would allow to continue a currently running task on an Agent) or if the Master will choose a different Agent (see Availability Redundancy, Agent Bundle).
- With a Network Connection Loss setting configured with the Agent's process class the Agent will show the following behavior (JS-1524):
- For the number of times specified for the tolerated unsuccessful connection attempts the Agent will assume the Network Connection Loss scenario.
- The Agent will continue any running tasks up to the specified retry attempts to establish the connection with the Master.
- Reconciliation will take place if the connection between Master and Agent can be re-established within the number of retries and if the Master has not been restarted.
- Otherwise the Agent will assume the Master Service Failure scenario and will kill any running tasks.
- This behavior applies to tasks that are executed for a specific Master to which a connection has been lost. Tasks for other JobScheduler Master instances will be continued.
- By default an Agent will kill any running tasks immediately if the connection to the Master gets lost, i.e. none of the above scenarios is supported (JS-1523). The reasons for this are:
- Master/Agent Reconciliation
- After connection loss the Master will regularly attempt to re-establish the HTTP connection to the Agent. This communication allows the Agent to report the execution status of running jobs back to the Master.
- After a successful re-connect within the Network Connection Loss scenario the Master will repeat its request for execution of the respective jobs. Each new request includes an identifier for the previous execution request that allows the Agent to identify repeated requests:
- for a job that has been completed within the time required to re-establish the connection the Agent will report the execution result back to the Master and will not re-execute the job.
- for a job that is still running the Agent will report the appropriate information back to the Master which will note the running tasks and update JOC accordingly.
- Feature Availability
Display feature availability StartingFromRelease 1.10.2
Delimitation
- This feature is intended to prevent simultaneous duplicate execution of jobs, it is not intended to prevent any duplicate execution of jobs.
- If a task is completed within the period that is implied with the retry attempts to establish the connection then this will lead to consecutive duplicate execution as the Master will request the task to be re-executed. However, this scenario applies to jobs only that are running for less than 5s.
- We recommend that your job scripts are designed to be aware of duplicate execution.
- This feature covers the situation of a short-term Network Connection Loss, not of a an on-going network outage.
- A connection loss is recovered by repeated attempts to reconnectre-connect.
- A An on-going network outage would require the Agent to work autonomously which is not in scope of this feature.
- This feature is not intended to support a Master Service Failure scenario or Database Connection Loss scenario.
...