Scope
- Fault Tolerance, Resilience and Redundancy provide high-availability of JobScheduler for a number of outage scenarios:
- High Availability requires the system including JobScheduler, database, storage etc. to be available, not just one component.
- High Availability is oriented towards specific outage scenarios, not towards any possible failure.
- Master / Agent Resilience includes a number of measures for operational robustness:
- Master / Agent Reconciliation allows continued execution of tasks in case of recoverable Network Connection Loss.
- Master Service Recovery includes supported measures after a Master Service Failure.
- Database Service Recovery includes the capability to recover in case of Database Connection Loss.
- Master / Agent Redundancy includes a number of architecture decisions:
- Master Clusters provide redundancy of Master instances in a network.
- Agent Clusters can be used to compensate the outage of a server that runs an Agent.
- Recovery Strategies provide an overview of means how to restore the scheduling service
Master / Agent Resilience
- Resilience includes support for a number of outage scenarios with automated and manual recovery.
- JS-1518Getting issue details... STATUS - Outage Scenarios
- Network Connection Loss
- A connection loss between Master and Agent. The Master will retry attempts to establish the connection and to re-send requests for a number of times.
- Connection loss includes that from the beginning the JobScheduler Master and Agent have no knowledge if the network connection failed or if a Master Service Failure occurred.
- This scenario is intended for a connection failure that can be recovered by retry attempts to establish a connection, it is not intended to recover from an on-going network outage.
- Master Service Failure
- Either a loss of the connection between Master and Agent that cannot be recovered within the number of retry attempts specified for the Network Connection Loss scenario
- due to a server crash or
- due to a JobScheduler Master crash.
- Or an unplanned JobScheduler Master restart or server restart.
- Either a loss of the connection between Master and Agent that cannot be recovered within the number of retry attempts specified for the Network Connection Loss scenario
- Database Connection Loss
- A connection loss beetween Master and database:
- for a JobScheduler Active Cluster this scenario includes a period of less than 120s during which a cluster member retries attempts to establish the connection.
- for a JobScheduler Passive Cluster this scenario includes no restriction of duration, it can be configured to retry attempts to connect to the database endlessly.
- Connection loss includes that the JobScheduler Master has no knowledge if the database service failed or if the connection failed.
- A connection loss beetween Master and database:
- Network Connection Loss
Master / Agent Reconciliation
Scenario
- Outage Scenario
- Network Connection Loss
- A connection loss between Master and Agent. The Master will retry attempts to establish the connection and to re-send requests for a number of times.
- Network Connection Loss
- Supported Scenario
- Master/Agent Reconciliation addresses the Network Connection Loss scenario, not the Master Service Failure and Database Connection Loss scenarios.
Feature
- Reconciliation Scenario
- Applies after a Network Connection Loss between Master and Agent.
- If the connection can be re-established then running tasks are continued with the Agent, otherwise running tasks are killed.
- Objectives
- If a Master were not available for a longer period then the Agent could not report back the execution history and log information for tasks. This would result in the fact that no information is available with the Master if the job execution has been successful or not.
- The primary goal is to prevent duplicate simultaneous execution of jobs. Without further information from a Master the respective Agent instance cannot know if later on it will be contacted for re-execution of the same job (which would allow to continue a currently running task on an Agent) or if the Master will choose a different Agent (see Redundancy, Agent Bundle).
- The secondary goal is to support re-establishing the communication between Master and Agent and to continue running tasks. Tasks that make use of the JobScheduler API cannot run independently from the Master and are delayed within the scope of this feature.
Master/Agent Heartbeats
- The Master and Agent send heartbeats to each other.
- The Agent receives HTTP POST requests from the Master and will respond within 5s, independently from the completion of the command that has been requested by the Master.
- The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
- If the Agent does not receive a heartbeat from the Master within the double period (10s) then the Agent will assume the connection to be lost and will kill the task.
- If the Master does not receive a heartbeat from the Agent then the Master will consider the task being lost and will assign the task an error state.
- The Master and Agent send heartbeats to each other.
- Master/Agent Reconciliation
- For a Network Connection Loss scenario the Master and Agent provide reconciliation capabilities:
- JS-1524Getting issue details... STATUS- The Agent will continue any running tasks up to the specified number of retry attempts to establish the communication by the Master.
- Reconciliation will take place if the connection between Master and Agent can be established within the number of retries and if the Master has not been restarted.
- Otherwise the Agent will assume the Master Service Failure scenario and will kill any running tasks.
- JS-1523Getting issue details... STATUS
- This behavior applies to tasks that are executed by an Agent for a specific Master to which a connection has been lost. Tasks for other JobScheduler Master instances will be continued.
- The Agent will continue any running tasks up to the specified number of retry attempts to establish the communication by the Master.
- After a successful re-connect within the Network Connection Loss scenario the Master will repeat its request for execution of the respective jobs. Each new request includes an identifier for the previous execution request that allows the Agent to identify repeated requests:
- for a job that has been completed within the time required to re-establish the connection the Agent will report the execution result back to the Master and will not re-execute the job.
- for a job that is still running the Agent will report the appropriate information back to the Master which will note the running tasks and update JOC accordingly.
- For a Network Connection Loss scenario the Master and Agent provide reconciliation capabilities:
- Feature Availability
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2
Delimitation
- This feature is intended to prevent simultaneous duplicate execution of jobs, it is not intended to prevent any consecutive duplicate execution of jobs.
- If a task is completed within the period that is implied with the retry attempts to establish the connection then this will lead to consecutive duplicate execution as the Master will request the task to be re-executed. However, this scenario applies to jobs only that are running for less than 10s.
- We recommend that your job scripts are designed to be aware of possible duplicate execution.
- This feature covers the situation of a short-term Network Connection Loss, not of an on-going network outage.
- A connection loss is recovered by repeated attempts to re-connect.
- An on-going network outage requires the Agent to work autonomously which is not in scope of this feature.
- This feature is not intended to support a Master Service Failure scenario or Database Connection Loss scenario.
Change Management References
Master Service Recovery
Scenario
- Outage Scenario
- Master Service Failure
- A loss of the connection between Master and Agent that cannot be established within the number of retry attempts specified for the Network Connection Loss scenario (see Master / Agent Reconciliation) or
- A JobScheduler Master restart or server restart.
- Master Service Failure
- Supported Scenario
- Master Service Recovery addresses the Master Service Failure scenario, not any scenario for Network Connection Loss or Database Connection Loss.
Feature
- The JobScheduler Master can be be configured to start in paused mode after a Master Service Failure.
- JS-1522Getting issue details... STATUS- The paused mode prevents all jobs from being started.
- JS-1511Getting issue details... STATUS
This applies to- jobs that have previously been requested for execution with Agents,
- jobs that have been enqueued and
- jobs that are scheduled for execution using start time events, file events or external events.
- All job starts that are delayed due to paused mode will be executed after the JobScheduler Master is continued.
- This also applies to jobs that are enqueued while paused mode is active.
- The operation to continue JobScheduler is available with JOC.
- Paused mode allows users to manually check the job history and optionally remove enqueued tasks if Agent Reconciliation has not taken place.
- The Agent stores log files of jobs during execution. If an execution result cannot be reported to the Master then the log file will be retained, otherwise it will be removed.
- JS-1521Getting issue details... STATUS - Paused mode can be configured to be applied automatically in case of restart of a JobScheduler Master after failure (JS-1522).
- The Agent stores log files of jobs during execution. If an execution result cannot be reported to the Master then the log file will be retained, otherwise it will be removed.
- The paused mode prevents all jobs from being started.
- Feature Availability
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2
Delimitation
- The currently supported measures include manual checking of Agent task logs after failure.
- The execution history of jobs that completed on an Agent during the Master Service Failure period is not reported back to the Master.
- The Agent will kill running tasks after expiration of the Network Connection Loss scenario. Therefore it is recommended that the Agent tasks logs are checked for successful or unsuccessful execution of jobs.
- Automated recovery of the Master/Agent execution status after a Master Service Failure will be subject to future improvements.
Change Management References
Database Service Recovery
Scenario
- Outage Scenario
- Database Connection Loss
- A connection loss between JobScheduler Master and its database.
- JobScheduler cannot decide if a database failed or if a connection loss occurred and will handle both events according to the Database Connection Loss scenario.
- Database Connection Loss
- Supported Scenario
- Database Service Recovery addresses the Database Connection Loss scenario, not any scenario for Network Connection Loss or Master Service Failure.
Feature
- If a transaction failure occurs the JobScheduler Master will try to rollback the transaction and will disconnect from the database.
- If the connection to a database gets lost or if a transaction failure occurs then the JobScheduler Master will try to re-connect every 60s.
- JS-1032Getting issue details... STATUS
- JS-1283Getting issue details... STATUS- A Master single instance can be configured to repeat an unlimited number of connection attempts.
- A Master Active Cluster member requires the database connection to become available within less than 120s. Otherwise the cluster member terminates in order to prevent duplicate execution of jobs in the cluster
- In case of Database Connection Loss the JobScheduler Master will switch to paused mode, i.e. any execution of new tasks will be postponed.
- JS-1511Getting issue details... STATUS - If the connection to the database can be re-established within the same JobScheduler session then all postponed tasks will be executed immediately.
- If the connection to the database is established after a JobScheduler restart then
- previously enqueued tasks will be executed immediately.
- start times for scheduled tasks will be re-calculated, i.e. tasks that have been scheduled for the period in which the JobScheduler was not active will not be executed.
Delimitation
- The capability to re-connect to a database does not imply that JobScheduler will cope with data loss, in fact JobScheduler relies on the job history and job-related status information being consistent and available with the database.
- For use with replicated databases keep in mind that the delay that is caused by replication can result in data loss.
- Depending on the DBMS this delay might be short, however, it might result in duplicate execution of jobs if the information about a previous job run is not available with the replicated database in case of fail-over.
- To our knowledge replicated databases are frequently used to achieve a database availability of up to approx. 99.9%.
- For use with clustered databases JobScheduler does not rely on vendor-specific connection continuity mechanisms but complies with JDBC standards available with all DBMS products and will always re-connect after connection loss or occurrence of a failed transaction.
- Unsupported vendor-specific mechanisms include e.g. SQL Server® multi-subnet clustering or MySQL® with Galera® JDBC fail-over that expect the client to switch the connection to some different address.
- In case of fail-over the clustered database is expected to be available with the same connection attributes, e.g. hostname, port. This can include mechanisms as e.g. DNS switching to make a different database server the primary server in case of fail-over.
- To our knowledge clustered databases are frequently used to achieve a database availability of up to approx. 99.999%.
Change Management References