Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Scope

Excerpt Include
High - Availability
High - Availability
nopaneltrue

Master / Agent Resilience

...

  • Resilience includes support for a number of outage scenarios with automated and manual recovery.
    Jira
    serverSOS JIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId6dc67751-9d67-34cd-985b-194a8cdc9602
    keyJS-1518
     
  • Outage Scenarios
    • Network Connection Loss
      • A short-term connection loss between Master and Agent that . The Master will retry attempts to establish the connection and to re-send requests for a configurable number of times.
      • Connection loss includes that from the beginning the JobScheduler Master and Agent have no knowledge if the network connection failed or if a Master Service Failure occurred.
      • This scenario is intended for a connection failure that can be recovered by retry attempts to establish a connection, it is is not intended to recover from an on-going network outage. 
    • Master Service Failure
      • Either a loss of the connection between Master and Agent that cannot be recovered within the number of retry attempts specified for the Network Connection Loss scenario
        • due to a server crash or
        • due to a JobScheduler Master crash.
      • Or an unplanned JobScheduler Master restart or server restart.
    • Database Connection Loss
      • A short-term connection loss beetween Master and database:
        • for a JobScheduler Active Cluster this scenario includes a period of less than 120s during which a cluster member retries attempts to establish the connection.
        • for a JobScheduler Passive Cluster this scenario includes no restriction of duration, it can be configured to retry attempts to connect to the database endlessly.
          • factory.ini max_db_errors=0
      • Connection loss includes that the JobScheduler Master has no knowledge if the database service failed or if the connection failed.

...

  • Outage Scenario
    • Network Connection Loss
      • A short-term connection loss between Master and Agent that . The Master will retry attempts to establish the connection and to re-send requests for a configurable number of times.
  • Supported Scenario
    • Master/Agent Reconciliation addresses the Network Connection Loss scenario, not the Master Service Failure and Database Connection Loss scenarios.

...

  • Reconciliation Scenario
    • Applies after a Network Connection Loss between Master and Agent.
    • Includes 5 attempts to establish the normal relationship between Master and Agent after a connection loss. A delay of less than 1s is assumed between retry attempts.
    • If the connection can be re-established then running tasks are continued with the Agent, otherwise running tasks are killed.
  • Objectives
      Agent Behavior
      • By default an Agent will kill any running tasks if the connection to the Master gets lost, i.e. the above scenario is not supported. The reasons for this include:
        Jira
        serverSOS JIRA
        columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
        serverId6dc67751-9d67-34cd-985b-194a8cdc9602
        keyJS-1523
        If a Master were not available for a longer period then the Agent could not report back the execution history and log information for tasks. This would result in the fact that no information is available with the Master if the job execution has been successful or not.
      • The primary goal is to prevent duplicate simultaneous execution of jobs. Without further information from a Master the respective Agent instance cannot know if later on it will be contacted for re-execution of the same job (which would allow to continue a currently running task on an Agent) or if the Master will choose a different Agent (see see RedundancyAgent Bundle).Cluster).
      • The secondary goal is to support re-establishing the communication between Master and Agent and to continue running tasks. Tasks that make use of the JobScheduler API cannot run independently from the Master and are delayed within the scope of this feature.
    • Master/Agent Heartbeats

      • The Master and Agent send heartbeats to each other.
        • The Agent receives HTTP POST requests from the Master and will respond within 5s, independently from the completion of the command that has been requested by the Master.
        • The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
      • If the Agent does not receive a heartbeat from the Master within the double period (10s) then the Agent will assume the connection to be lost and will kill the task.
      • If the Master does not receive a heartbeat from the Agent then the Master will consider the task being lost and will assign the task an error state.
    • Master/Agent Reconciliation
      • For a Network Connection Loss scenario the Master and Agent provide reconciliation capabilitiesWith a Network Connection Loss setting configured with the Agent's process class the Agent will show the following behavior:
        Jira
        serverSOS JIRA
        columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
        serverId6dc67751-9d67-34cd-985b-194a8cdc9602
        keyJS-1524
        • For the number of times specified for tolerated unsuccessful connection attempts the Agent will assume the Network Connection Loss scenario.
        • The Agent will continue any running tasks up to the specified number of retry attempts to establish the connection with communication by the Master.
          • Reconciliation will take place if the connection between Master and Agent can be established within the number of retries and if the Master has not been restarted.
          • Otherwise the Agent will assume the Master Service Failure scenario and will kill any running tasks.
            Jira
            serverSOS JIRA
            columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
            serverId6dc67751-9d67-34cd-985b-194a8cdc9602
            keyJS-1523
             
        • This behavior applies to tasks that are executed by an Agent for a specific Master to which a connection has been lost. Tasks for other JobScheduler Master instances will be continued.
      Master/Agent Reconciliation
      • After connection loss the Master will regularly attempt to re-establish the HTTP connection to the Agent. This communication allows the Agent to report the execution status of running jobs back to the Master.After a successful re-connect within the Network Connection Loss scenario the Master will repeat its request for execution of the respective jobs. Each new request includes an identifier for the previous execution request that allows the Agent to identify repeated requests:
        • for a job that has been completed within the time required to re-establish the connection the Agent will report the execution result back to the Master and will not re-execute the job.
        • for a job that is still running the Agent will report the appropriate information back to the Master which will note the running tasks and update JOC accordingly.
    • Feature Availability
      • Display feature availability
        StartingFromRelease1.10.2

    ...

    • This feature is intended to prevent simultaneous duplicate execution of jobs, it is not intended to prevent any consecutive duplicate execution of jobs.
      • If a task is completed within the period that is implied with the retry attempts to establish the connection then this will lead to consecutive duplicate execution as the Master will request the task to be re-executed. However, this scenario applies to jobs only that are running for less than 5s10s.
      • We recommend that your job scripts are designed to be aware of possible duplicate execution.
    • This feature covers the situation of a short-term Network Connection Loss, not of an on-going network outage.
      • A connection loss is recovered by repeated attempts to re-connect. 
      • An on-going network outage requires the Agent to work autonomously which is not in scope of this feature.
    • This feature is not intended to support a Master Service Failure scenario or Database Connection Loss scenario.

    ...

    Jira
    serverSOS JIRA
    columnskey,summary,type,createdkey,updatedissuelinks,duefixversions,assigneestatus,reporter,priority,statussummary,resolutionupdated
    maximumIssues20
    jqlQuerylabels in (reconciliation)
    serverId6dc67751-9d67-34cd-985b-194a8cdc9602

    ...

    • The currently supported measures include manual checking of Agent task logs after failure. 
      • The execution history of jobs that completed on an Agent during the Master Service Failure period is not reported back to the Master.
      • The Agent will kill running tasks after expiration of the Network Connection Loss scenario. Therefore it is recommended to check that the Agent tasks logs are checked for successful or unsuccessful execution of jobs.
    • Automated recovery of the Master/Agent execution status after a Master Service Failure will be subject to future improvements.

        ...

          • See
            Jira
            serverSOS JIRA
            columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
            serverId6dc67751-9d67-34cd-985b-194a8cdc9602
            keyJS-1549
            for more information.

        Change Management References

        Jira
        serverSOS JIRA
        columnstype,key,issuelinks,fixversions,status,priority,summary,updated
        maximumIssues20
        jqlQuerylabels in (master-recovery)
        serverId6dc67751-9d67-34cd-985b-194a8cdc9602

        ...

        Jira
        serverSOS JIRA
        columnskey,summary,type,createdkey,updatedissuelinks,duefixversions,assignee,reporterstatus,priority,statussummary,resolutionupdated
        maximumIssues20
        jqlQuerylabels in (database-recovery)
        serverId6dc67751-9d67-34cd-985b-194a8cdc9602

        ...