Page History
...
Primary and Secondary Controller instances require a single Agent to be available that acts as an arbitrator in case of fail-over and switch-over.
Start-up of Controller Instances
- On start-up both Primary and Secondary Controller instances connect to the Cluster Watch Agent.
- The Cluster Watch Agent give its vote
Failure of the Active Controller Instance
x
- The Agent Cluster Watcher knows immediately when the active Controller instance is down.
- The standby Controller instance similarly has a connection to the active Controller instance and knows immediately when this connection is interrupted.
- This is the point in time when passive Controller instance and the Agent Cluster Watcher check if they find “common ground”. This works similar to a funeral society, they determine if they consider the active Controller instance being dead and after a very short period (1-2s of crying tears) they proceed and give their 2 votes if the passive Controller instance should now become the active one.
Cluster Operations
Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.
...
Fail-over occurs when the Active Controller Instance is terminated abnormally.
No fail-over occurs when
...
...
Fail-over can be invoked by the following actions:
- The Active Controller Instance is killed, for example
- for Unix with a SIGKILL signal corresponding to the command:
kill -9
- for Windows with the command:
taskkill /F
- for Unix with a SIGKILL signal corresponding to the command:
- The operating system crashes.
- In the JS7 - Dashboard the user performs one of the operations:
- Active Controller Instance action menu: Abort -> With fail-over
- Active Controller Instance action menu:Abort and restart -> With fail-over
- From the command line the user performs one of the operations:
controller.sh | .cmd abort
controller.sh | .cmd kill
Fail-over happens within a short period of time, typically in 2-3s.
...
Switch-over occurs exclusively when invoked by user intervention.
No switchNo fail-over occurs when
- the Active Controller Instance is stopped normally from the command line:
controller.sh | .cmd stop
- the operating system is shut down and
systemd
/init.d
or a Windows Service are in place to stop the Controller normally.
Fail-over happens within a short period of time, typically in 2-3s.
Switch-over
Switch-over occurs exclusively when invoked by user intervention.
Switch-over can be invoked by the following actions:
- In the JS7 - Dashboard the user performs one of the operations:
- Active Controller Instance action menu: Abort -> With fail-over
- Active Controller Instance action menu:Abort and restart -> With fail-over
No switch-over occurs when
- the Active Controller Instance is stopped normally from the command line:
controller.sh | .cmd stop
- the operating system is shut down and
systemd
/init.d
or a Windows Service are in place to stop the Controller normally.
Switch-over happens within a short period of time, typically in 2-3s.
A Warning to Users trying to implement their own Clustering Mechanism
Users might be tempted to implement their own clustering with Standalone Controller Instances, for example
- using tools for virtual machine management such as VMware®
- using Microsoft® Windows Server Cluster
The best advice is not to apply such clustering mechanisms. Reasons include but are not limited to the following issues:
- The cluster has to guarantee that only one of both instances is started at any point in time.
- Should this rule not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will result in double job execution.
- Controller journals will be messed up with the same orders in different state transitions.
- The only solution is to drop both Controller instance's journals that are available from the
state
sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
- There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
- Performing PID file checks is of limited use: this can prove the unavailability of a Controller but a positive PID file check does not prove that a Controller instance is working.
- Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent a situation to be recovered within the next few seconds.