Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Primary and Secondary Controller instances require a single Agent to be available that acts as an arbitrator in case of fail-over and switch-over.

Start-up of Controller Instances

  • On start-up both Primary and Secondary Controller instances connect to the Cluster Watch Agent.
    • The Cluster Watch Agent give its vote

Failure of the Active Controller Instance

x

  • The Agent Cluster Watcher knows immediately when the active Controller instance is down.
  • The standby Controller instance similarly has a connection to the active Controller instance and knows immediately when this connection is interrupted.
  • This is the point in time when passive Controller instance and the Agent Cluster Watcher check if they find “common ground”. This works similar to a funeral society, they determine if they consider the active Controller instance being dead and after a very short period (1-2s of crying tears) they proceed and give their 2 votes if the passive Controller instance should now become the active one.

Cluster Operations

Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.

...

Fail-over occurs when the Active Controller Instance is terminated abnormally. 

No fail-over occurs when

...

...

Fail-over can be invoked by the following actions:

  • The Active Controller Instance is killed, for example
    • for Unix with a SIGKILL signal corresponding to the command: kill -9
    • for Windows with the command: taskkill /F
  • The operating system crashes.
  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over
  • From the command line the user performs one of the operations:
    • controller.sh | .cmd abort
    • controller.sh | .cmd kill

Fail-over happens within a short period of time, typically in 2-3s.

...

Switch-over occurs exclusively when invoked by user intervention.

No switchNo fail-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Fail-over happens within a short period of time, typically in 2-3s.

Switch-over

Switch-over occurs exclusively when invoked by user intervention.

Switch-over can be invoked by the following actions:

  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over

No switch-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Switch-over happens within a short period of time, typically in 2-3s.

A Warning to Users trying to implement their own Clustering Mechanism

Users might be tempted to implement their own clustering with Standalone Controller Instances, for example

  • using tools for virtual machine management such as VMware®
  • using Microsoft® Windows Server Cluster

The best advice is not to apply such clustering mechanisms. Reasons include but are not limited to the following issues:

  • The cluster has to guarantee that only one of both instances is started at any point in time.
    • Should this rule not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • The only solution is to drop both Controller instance's journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller but a positive PID file check does not prove that a Controller instance is working.
    • Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent a situation to be recovered within the next few seconds.