Connection Heartbeats for Master and Agent

Scope

JobScheduler Master and Agents check availability of the communication partner by regularly sending heartbeats.
Heartbeats are sent via the HTTP connection that is established by the Master to the Agent. Bi-directional heartbeats make use of this connection.
- The Agent receives HTTP POST requests from the Master and will respond within short time, independently from the completion of the command that has been requested by the Master.
- The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
This allows Master and Agent to check if a connection has been lost and if it can be re-established.
FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2

Related Features

JS-1523 - Getting issue details... STATUS

JS-1524 - Getting issue details... STATUS

Concepts

Heartbeat Period:
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
Heartbeat Timeout:
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s
Heartbeat Delay:
- The time that the Master waits for before it receives the Agent's heartbeat.
- Value: 2s
- This is fixed parameter and can not be customized.

Behavior

Let's suppose an existent connection between a Master and an Agent. The Master and the Agent will behave as follows:

In case where there is no connection loss:
- the Master sends a HTTP Request to the Agent
- the Agent sends to the Master
  - a heartbeat after 10s to the Master should no other HTTP operation on behalf of the Master be executed.
  - a HTTP response when an operation is executed on behalf of the Master.
In case of connection loss after the Master has sent a first HTTP Request:
- the Master waits 12s for the heartbeat from the Agent to arrive
  - The Agent should answer with a heartbeat after 10s. This is the Hearbeat Period specified above.
  - The Master waits 2s more just in case - this is the Heartbeat Delay specified above.
- If a heartbeat from the Agent came between 10s and 12s (=10s Heartbeat Period + 2s Heartbeat Delay), any running tasks will be continued and completed by the Agent.
- If the Master did not receive the heartbeat from the Agent after 12s, the Master will repeat the first HTTP Request sent 12s ago until the Agent is able to answer
  - If the Agent is able to answer before 60s effected - that is, 48s after the HTTP Request repeat, any running tasks will be continued and completed by the Agent. Even though there were more HTTP Requests from the Master, the tasks will be executed just once.
  - If the Agent is not able to answer before 60s effected - that is, 48s after the HTTP Request repeat, the Master will kill any running tasks on the Agent.

To summarize:

the Master could kill any running task on the Agent if the connection loss exceeded 48s. This limit case would happen if the connection loss takes place exactly when the Master should receive the Agent's heartbeat.
the Master will always kill any running task if the connection loss exceeds 60s. This is the defined Hearbeat Timeout specified above.

Use Case

Kill Tasks in case of Connection Loss

If the Agent receives no heartbeats from the Master within 60 seconds then the Agent will
- assume the connection to be lost and
- kill any running tasks that have been requested by that Master.
- This behavior is intended to prevent simultaneous duplicate execution of tasks by an Agent.
If the Master receives no heartbeats from the Agent within the interval between 50 and 60 seconds then it will
- consider the task being lost, e.g. its request for execution of a task not to have been received by the Agent, and will assign the task an error state,
- try to re-establish the connection to the Agent,
- repeat the request for task execution if the connection to the Agent can be established.
In this situation the Agent will
- within a configurable grace period
  - continue any running tasks.
  - try to identify duplicate requests for task execution from the Master and drop duplicate requests if the task is running.
- kill the running tasks if the grace period is exceeded.

Continue Tasks in case of Reconciliation

If the Master successfully re-connects to the Agent within the grace period then
- running tasks will be continued and completed by the Agent.
- the task status and execution result will be reported to the Master.
In case of reconciliation the task status, log information and execution result are available for the Master and are visible with JOC.

Configuration

The heartbeat settings can be configured with the Process Classes that specify the Agent connection.
The configuration is located with the Master, no configuration items are stored with the Agent.

Settings

Heartbeat Period: http_heartbeat_period
- The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
- Default: 10s
Heartbeat Timeout: http_heartbeat_timeout
- The overall timeout that determines if a connection is considered to be lost permanently.
- Includes the heartbeat period and the delay after which the Master will send its heartbeat.
- Default: 60s

Example

heartbeat settings

<?xml version="1.0" encoding="utf-8"?>
<process_class>
    <remote_schedulers>
        <remote_scheduler remote_scheduler="http://127.0.0.2:4445" http_heartbeat_period="10" http_heartbeat_timeout="60"/>
    </remote_schedulers>
</process_class>

Delimitation

Connection heartbeats tend to render the use of keep-alive packets superfluous, see Connection Keep-Alive for Master and Agent
Connection hearbeats are used to detect a connection loss and to re-establish a connection within short time.
- They are not intended to cover longer network outages.
- They are not intended for recovery scenarios, i.e. both Master and Agent have to be up and running. If one of the components is restarted then this is considered a recovery scenario.

References

Change Management References

T	Key	Linked Issues	Fix Version/s	Status	P	Summary	Updated

Loading...

Refresh

Space shortcuts

Page tree