Introduction

The JS7 Controller and Agent will restart jobs in a number of situations:

  • restart job after termination with error.
  • restart job after Agent restart.
  • restart job from a next Subagent in an Agent Cluster in case that the Subagent running the job becomes unreachable. 

Restart Jobs after Error

If a job terminates with error, this includes that the Agent is available and is a witness to the job's failure.

For this situation users can apply the JS7 - Retry Instruction that specifies the number of tries and intervals to restart the job.

  • For Standalone Agents it will be the same Agent that restarts the job.
  • In an Agent Cluster a Subagent will be selected based on the Subagent Cluster configuration to restart the job.

If a job fails then the order is set to the failed state. While waiting for the next try in a Retry Instruction, the order will be set to the waiting state.

Restart Jobs on Restart of same Agent after Crash

If an Agent becomes unreachable while executing a job then this can indicate that

  • the Agent is not running, for example after a crash.
    • In case of Agent crash the JS7 - Agent Watchdog will terminate running jobs provided that the Watchdog is active.
  • the Agent continues to run, but no connection can be established, for example in case of network errors.

In this situation

  • for Standalone Agents the Controller does not know the execution status of the job as long as the Agent is unreachable.
  • for Subagents in an Agent Cluster the Director Agent does not know the execution status of the job as long as the Subagent is unreachable.

Not knowing a job's execution status denies to restart a job in order to prevent double job execution in case that the Agent is unreachable but continues to run the job.

If the Agent is restarted after crash then it will restart jobs that were running at the point in time when the Agent crashed.

  • This applies to Standalone Agents and to Subagents in an Agent Cluster.
  • Jobs that must exclude the risk of double job execution can can be exempted from restart if they are marked not being restartable:
    JS-2151 - Getting issue details... STATUS   JOC-1891 - Getting issue details... STATUS

For the time that an Agent is unreachable related orders are set to the blocked state. No operation is available on such orders until the Agent can be reached.

Restart Jobs from next Subagent after Reset

In an Agent Cluster in case that a Subagent becomes unreachable users find the operation to reset the Subagent. This will cause jobs to be restarted from the next Subagent.

  • The Manage Controllers/Agents page offers the Reset operation on individual Subagents. The Director Agent will consider this information and will restart jobs from the next Subagent.
    • Note: The Reset operation has to be applied to the related, unreachable Subagent, not to the Director Agent.
    • The operation should be handled with care as it can cause double job execution if the unreachable Subagent is still running the job. Before using the Reset operation users should verify that the Subagent is not running.
  • Jobs that must exclude the risk of double job execution can be exempted from restart if they are marked not being restartable:
    JS-2151 - Getting issue details... STATUS   JOC-1891 - Getting issue details... STATUS
  • Selection of the next Subagent is based on the type of Subagent Cluster, for example fixed-priority or round-robin.

Resources


  • No labels