Scope
- JobScheduler Master and Agents check availability of the communication partner by regularly sending heartbeats.
- Heartbeats are sent via the HTTP connection that is established by the Master to the Agent. Bi-directional heartbeats make use of this connection.
- The Agent receives HTTP POST requests from the Master and will respond within short time, independently from the completion of the command that has been requested by the Master.
- The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
- This allows Master and Agent to check if a connection has been lost and if it can be re-established.
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2
Related Features
- JS-1523Getting issue details... STATUS
- JS-1524Getting issue details... STATUS
Concepts
- Heartbeat Period:
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
- Heartbeat Timeout:
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s
- Heartbeat Delay:
- The delay that the JobScheduler
Use Case
Kill Tasks in case of Connection Loss
- If the Agent receives no heartbeats from the Master within 120 seconds then the Agent will
- assume the connection to be lost and
- kill any running tasks that have been requested by that Master.
- This behavior is intended to prevent simultaneous duplicate execution of tasks by an Agent.
- If the Master receives no heartbeats from the Agent within the interval between 50 and 60 seconds then it will
- consider the task being lost, e.g. its request for execution of a task not to have been received by the Agent, and will assign the task an error state,
- try to re-establish the connection to the Agent,
- repeat the request for task execution if the connection to the Agent can be established.
- In this situation the Agent will
- within a configurable grace period
- continue any running tasks.
- try to identify duplicate requests for task execution from the Master and drop duplicate requests if the task is running.
- kill the running tasks if the grace period is exceeded.
- within a configurable grace period
Continue Tasks in case of Reconciliation
- If the Master successfully re-connects to the Agent within the grace period then
- running tasks will be continued and completed by the Agent.
- the task status and execution result will be reported to the Master.
In case of reconciliation the task status, log information and execution result are available for the Master and are visible with JOC.
Configuration
- The heartbeat settings can be configured with the Process Classes that specify the Agent connection.
- The configuration is located with the Master, no configuration items are stored with the Agent.
Settings
- Heartbeat Period:
http_heartbeat_period
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
- Heartbeat Timeout:
http_heartbeat_timeout
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s
Example
keep-alive parameter
<?xml version="1.0" encoding="utf-8"?> <process_class> <remote_schedulers> <remote_scheduler remote_scheduler="http://127.0.0.2:4445" http_heartbeat_period="10" http_heartbeat_timeout="60"/> </remote_schedulers> </process_class>
Delimitation
- Connection heartbeats tend to render the use of keep-alive packets superfluous, see Connection Keep-Alive for Master and Agent
- Connection hearbeats are used to detect a connection loss and to re-establish a connection within short time.
- They are not intended to cover longer network outages.
- They are not intended for recovery scenarios, i.e. both Master and Agent have to be up and running. If one of the components is restarted then this is considered a recovery scenario.
References
Change Management References
Documentation