Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Fail-over is an automated operation that occurs when the Primary Controller is aborted or killed. Fail-over is applied in case of abnormal termination, see JS7 - Impact of a Controller outage.
  • Switch-over is an operation that is caused by user intervention in JOC Cockpit or by use of the JS7 - REST Web Service API. The procedure to switch-over does not require termination of an Active Controller Instance, instead it shifts the active role to the second Controller instance.

For fail-over and switch-over a dedicated Standalone Agent acting as a Cluster Watch Agent is required.

...

Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.

Fail-over

Fail-over occurs when the Active Controller Instance is terminated abnormally. 

Fail-over can be invoked by the following actions:

  • The Active Controller Instance is killed, for example
    • for Unix with a SIGKILL signal corresponding to the command: kill -9
    • for Windows with the command: taskkill /F
  • The operating system crashes.
  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over
  • From the command line the user performs one of the operations:
    • controller.sh | .cmd abort
    • controller.sh | .cmd kill

No fail-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Fail-over happens within a short period of time, typically in 2-3s.

Switch-over

Switch-over occurs exclusively when invoked by user intervention.

Switch-over can be invoked by the following actions:

  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over

No switch-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Switch-over happens within a short period of time, typically in 2-3s.

...

  • The cluster has to guarantee that only one of both Standalone Controller instances is started at any point in time.
    • Should this rule not be observed then both Controller instances will instruct request Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • In this situation the only solution is to drop both Controller instanceinstances' s journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller instance. However, a positive PID file check does not prove that a Controller instance works.
    • Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent includes to allow a situation to be recovered within the next few seconds.
  • A Controller Cluster guarantees high availability when used with a JS7 - Agent Cluster. Use of Standalone Agents limits high availability.

...