Connection Heartbeats for Master and Agent

Scope

JobScheduler Master and Agents check availability of the communication partner by regularly sending heartbeats.
Heartbeats are sent via the HTTP connection that is established by the Master to the Agent. Bi-directional heartbeats make use of this connection.
- The Agent receives HTTP POST requests from the Master and will respond within short time, independently from the completion of the command that has been requested by the Master.
- The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
This allows Master and Agent to check if a connection has been lost and if it can be re-established.
FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2

Related Features

JS-1523 - Getting issue details... STATUS

JS-1524 - Getting issue details... STATUS

Concepts

Heartbeat Period:
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
Heartbeat Timeout:
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s
Heartbeat Delay:
- The delay that the JobScheduler

Use Case

Kill Tasks in case of Connection Loss

If the Agent receives no heartbeats from the Master within 120 seconds then the Agent will
- assume the connection to be lost and
- kill any running tasks that have been requested by that Master.
- This behavior is intended to prevent simultaneous duplicate execution of tasks by an Agent.
If the Master receives no heartbeats from the Agent within the interval between 50 and 60 seconds then it will
- consider the task being lost, e.g. its request for execution of a task not to have been received by the Agent, and will assign the task an error state,
- try to re-establish the connection to the Agent,
- repeat the request for task execution if the connection to the Agent can be established.
In this situation the Agent will
- within a configurable grace period
  - continue any running tasks.
  - try to identify duplicate requests for task execution from the Master and drop duplicate requests if the task is running.
- kill the running tasks if the grace period is exceeded.

Continue Tasks in case of Reconciliation

If the Master successfully re-connects to the Agent within the grace period then
- running tasks will be continued and completed by the Agent.
- the task status and execution result will be reported to the Master.
In case of reconciliation the task status, log information and execution result are available for the Master and are visible with JOC.

Configuration

The heartbeat settings can be configured with the Process Classes that specify the Agent connection.
The configuration is located with the Master, no configuration items are stored with the Agent.

Settings

Heartbeat Period: http_heartbeat_period
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
Heartbeat Timeout: http_heartbeat_timeout
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s

Example

keep-alive parameter

<?xml version="1.0" encoding="utf-8"?>
<process_class>
    <remote_schedulers>
        <remote_scheduler remote_scheduler="http://127.0.0.2:4445" http_heartbeat_period="10" http_heartbeat_timeout="60"/>
    </remote_schedulers>
</process_class>

Delimitation

Connection heartbeats tend to render the use of keep-alive packets superfluous, see Connection Keep-Alive for Master and Agent
Connection hearbeats are used to detect a connection loss and to re-establish a connection within short time.
- They are not intended to cover longer network outages.
- They are not intended for recovery scenarios, i.e. both Master and Agent have to be up and running. If one of the components is restarted then this is considered a recovery scenario.

References

Change Management References

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution	Fix Version/s

Loading...

Refresh

Space shortcuts

Page tree

Scope

Related Features

Concepts

Use Case

Kill Tasks in case of Connection Loss

Continue Tasks in case of Reconciliation

Configuration

Settings

Example

Delimitation

References

Change Management References

Documentation