Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

Use of a JS7 - Controller Cluster provides high availability and is a feature subject to the JS7 - License.

  • Fail-over is an automated operation that occurs when the Primary Controller is aborted or killed. Fail-over is applied in case of abnormal termination only.
  • Switch-over is an operation that is caused by user intervention in JOC Cockpit or by use of the JS7 - REST Web Service API. The procedure includes normal to switch-over does not require termination of an Active Controller Instance, instead it shifts the active role to the second Controller instance.

For fail-over and switch-over a dedicated Standalone Agent acting as a Cluster Watch Agent is required.

...

The best advice is not to apply such clustering mechanisms. Reasons include but are not limited to the following issues:

  • A Controller Cluster guarantees high availability when used with a JS7 - Agent Cluster. Use of Standalone Agents limits high availability.
  • The cluster has to guarantee that only one of both Controller instances is started at any point in time.
    • Should this rule not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • In this situation the only solution is to drop both Controller instance's journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller instance. However, a positive PID file check does not prove that a Controller instance works.
    • Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent a situation to be recovered within the next few seconds.

Further Resources

...