Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Fail-over is an automated operation that occurs when the Primary Controller is aborted or killed. Fail-over is applied in case of abnormal termination, see JS7 - Impact of a Controller outage.
  • Switch-over is an operation that is caused by user intervention in JOC Cockpit or by use of the JS7 - REST Web Service API. The procedure to switch-over procedure does not require termination of an Active Controller Instance, instead it shifts the active role to the second Controller instance.

...

Cluster Roles

Controller Cluster

The Job scheduling documentation frequently indicates a contains references to Primary Controller Instance Instances and a Secondary Controller InstanceInstances. The names suggest These names may be seen as implying that one Controller Instance is primarily used and one is for backup purposes.

  • The wording of cluster terms suggests to indicate the Active Controller Instance and the Standby Controller Instance independently from the fact if are often more significant, regardless of whether the Primary or Secondary Controller Instance is active.
  • A Controller implements an active-passive cluster, however, the term passive is misleading as the Standby Controller Instance is not passive at all but records any order state transitions occurring in the Active Controller Instance. Both Controller instances hold a journal of order state transitions that is actively synchronized. Fail-over and switch-over will occur only if both Controller instances' journals are in sync.
  • The Cluster presents itself as a single unit to the outside world, i.e. to JOC Cockpit and to Agents.
    • Any operations performed in JOC Cockpit are automatically applied to the Active Controller Instance.
    • At any point in time only one Controller instance is active and the other instance is in standby mode.

...

Operation of Cluster Watch Agent

Above The above explanations suggest that a Cluster Watch Agent may should never be running run on the hosts that the Primary and Secondary Controller instances are operated foron.

  • If the Cluster Watch Agent is terminated at the same time as a failed Active Controller Instance then no fail-over can occur.
  • If the Cluster Watch Agent is terminated at the same time as one of the Controller instances then the Controller Cluster cannot start up as this requires operational readiness of the Cluster Watch Agent.
  • A Cluster Watch Agent that is started after failure of the Active Controller Instance is disqualified from casting its vote as it has no knowledge if of whether the Controller instances' journals are  are in sync.

Cluster Operations

...

  • The Active Controller Instance is killed, for example:
    • for on Unix with a SIGKILL signal corresponding to the command: kill -9
    • for on Windows with the command: taskkill /F
  • The operating system crashes.
  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over



  • From the command line the user performs one of the operations:
    • controller.sh | .cmd abort
    • controller.sh | .cmd kill

No failFail-over occurs will not occur when:

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

...

  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Terminate -> With switch-over
    • Active Controller Instance action menu:Terminate and restart -> With switch-over
    • Cluster action menu: Switch-over



No switchSwitch-over occurs will not occur when:

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

...

Users might be tempted to implement their own clustering with Standalone Controller Instances, for example:

  • using tools for virtual machine management such as VMware®,
  • using Microsoft® Windows Server Cluster or similar cluster solutions.

...

  • The cluster has to guarantee that only one of both Standalone Controller instances is started at any point in time.
    • Should IF this rule is not be observed then both Controller instances will request instruct Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • In this situation the only solution is to drop both Controller instances' journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect condition to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller instance. However, a positive PID file check does not prove that a Controller instance worksis working.
    • Log file analysis is pointless. Controllers are heavily making make use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files includes to allow allowing a situation to be recovered within the next few seconds.
  • A Controller Cluster guarantees high availability when used with a JS7 - Agent Cluster. Use of Standalone Agents limits high availability.

...