Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

Use of a JS7 - Controller Cluster provides high availability and is a feature subject to the JS7 - License.

  • Fail-over is an automated operation that occurs when the Primary Controller is aborted or killed. Fail-over is applied in case of abnormal termination only, see JS7 - Impact of a Controller outage.
  • Switch-over is an operation that is caused by user intervention in JOC Cockpit or by use of the JS7 - REST Web Service API. The procedure includes normal switch-over procedure does not require termination of an Active Controller Instance, instead it shifts the active role to the standby Controller instance.

For fail-over and switch-over a the role of the Cluster Watch Agent is required that acts as an arbitrator in situations when the Controller Cluster cannot determine about the active instance. Either JOC Cockpit or an Agent can be assigned the Cluster Watch role.

For command line references see the JS7 - Controller - Command Line Operation article.

Cluster Roles

Controller Cluster

The Job scheduling documentation frequently indicates a contains references to Primary Controller Instance Instances and a Secondary Controller InstanceInstances. The names suggest These names may be seen as implying that one Controller Instance is primarily used and one is for backup purposes.

  • The wording in cluster terms suggests to indicate the Active Controller Instance and the Standby Controller Instance independently from the fact if the are often more significant, regardless of whether it is the Primary or Secondary Controller Instance which is active.
  • A Controller implements an active-passive cluster, however, the term passive is misleading as the Standby Controller Instance is not passive at all but records any order state transitions occurring in the Active Controller Instance. Both Controller instances hold a journal of order state transitions that is actively synchronized. Fail-over and switch-over will occur only if both Controller instanceinstances' s journals are in sync.
  • The Cluster presents itself as a single unit to the outside world, i.e. to JOC Cockpit and to Agents.
    • Any operations performed in JOC Cockpit are automatically applied to the Active Controller Instance.
    • At any point in time only one Controller instance is active and the other instance is in standby mode.

Cluster Watch

...

Role

Primary and Secondary Controller instances require a single Agent to be available that acts JOC Cockpit to act as Cluster Watch, i.e. as an arbitrator in case of fail-over and switch-over.

Start-up of Controller Cluster

Connections

  • If JOC Cockpit is acting as Cluster Watch then JOC Cockpit will establish a connection on start-up of Primary and Secondary Controller

...

  • instances.
  • If an Agent is acting as Cluster Watch then on On start-up both Primary and Secondary Controller instances connect will establish a connection to the Cluster Watch Agent.

Proceeding

  • The Cluster Watch Agent give its votecasts its vote which Controller instance owns the leading journal.
  • Start-up of a Controller Cluster is not possible with the Cluster Watch being unavailable.
  • Start-up of the previously Active Controller Instance without the Standby Controller Instance being available is possible if the Cluster Watch is active.

Failure of the Active Controller Instance

xIn case of failure of the Active Controller Instance the Cluster Watch plays its role as an arbitrator:

  • The Agent Cluster WatcherWatch knows immediately when the active Active Controller instance Instance is down due to a connection loss from this instance.
  • The standby Controller instance similarly has Standby Controller Instance holds a connection to the active Active Controller instance Instance and knows immediately when this connection is interruptedlost.
  • This Failure of the Active Controller Instance is the point in time when passive the Standby Controller instance Instance and the Agent Cluster Watcher Watch check if they find “common ground”. This works similar to a funeral society, they to find common ground about a cluster fail-over operation: They determine if they consider should declare the active Active Controller instance Instance being dead inoperable and after a very short period (of 1-2s of crying tears) they proceed and give cast their 2 votes if the passive Standby Controller instance should now become the active one.Instance should now become the Active Controller Instance.
  • As a prerequisite for fail-over both the Cluster Watch and the Standby Controller Instance have to confirm that the Standby Controller Instance's journal was in sync with the Active Controller Instance at the point in time of failure.

Operation of Cluster Watch

Operation

  • If JOC Cockpit is assigned the Cluster Watch role then fail-over capabilities of JOC Cockpit apply.
  • If an Agent is assigned the Cluster Watch role (available for earlier releases of JS7 until branch 2.5) then the above explanations suggest that the Agent should never be run on the hosts that the Primary and Secondary Controller instances are operated on.

Proceeding

  • If the Cluster Watch  is terminated at the same time as a failed Active Controller Instance then no fail-over can occur.
  • If the Cluster Watch is terminated at the same time as one of the Controller instances then the Controller Cluster cannot start up as this requires operational readiness of the Cluster Watch.
  • A Cluster Watch that is started after failure of the Active Controller Instance is disqualified from casting its vote as it has no knowledge of whether the Controller instances' journals are in sync.

High Availability Setup

For high availability setup with two server nodes the following distribution of active and standby JS7 products should be applied:

Server 1Server 2
Active JOC Cockpit Instance

Standby JOC Cockpit Instance

Standby Controller InstanceActive Controller Instance

Cluster Cluster Operations

Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.

Fail-over

Fail-over occurs when the Active Controller Instance is terminated abnormally. 

Fail-over can be invoked by the following actions:

  • The Active Controller Instance is killed, for example:
    • for on Unix with a SIGKILL signal corresponding to the command: kill -9
    • for on Windows with the command: taskkill /F
  • The operating system crashes.
  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu:Abort and restart -> With fail-over

      Image Added

  • From the command line the user performs one of the operations:
    • controller_instance.sh | .cmd abortkill
    • controller_instance.sh | .cmd killabort

No failFail-over occurs will not occur when:

  • the Active Controller Instance is stopped normally from the command line:
    • controller_instance.sh | .cmd stop
  • the Active Controller Instance is restarted normally from the command line:
    • controller_instance.sh | .cmd restart
  • the operating system is shut down normally and systemd / init.d or a Windows Service are in place to stop the Controller normally.
  • the Active JOC Cockpit Instance is not running as it holds the Cluster Watch role that is required for fail-over.

Fail-over happens within a short period of time, typically in 2-3s.

Switch-over

Switch-over occurs exclusively when invoked by user intervention.

Switch-over can be invoked by the following actions:

  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort Terminate -> With failswitch-over
    • Active Controller Instance action menu:Abort Terminate and restart -> With fail-overswitch-over
    • Cluster action menu: Switch-over
  • From the command line the user performs the operation:
    • controller_instance.sh | .cmd switch-over


      Image Added

Switch-over will not occur when:No switch-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller_instance.sh | .cmd stop
  • the Active Controller Instance is restarted normally from the command line:
    • controller_instance.sh | .cmd restart
  • the operating system is shut down normally and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Switch-over happens within a short period of time, typically in 2-3s.

...

Users might be tempted to implement their own clustering with Standalone Controller Instances, for example:

  • using tools for virtual machine management such as VMware®,
  • using Microsoft® Windows Server Cluster or similar cluster solutions.

The best advice is not to apply such automated clustering mechanisms, but to perform manual switch-over. Reasons include but are not limited to the following issues:

  • The cluster has to guarantee that only one of both Standalone Controller instances is started at any point in time.
    • Should If this rule is not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will can result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • The In this situation the only solution is to drop both Controller instanceinstances' s journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape condition to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller but instance. However, a positive PID file check does not prove that a Controller instance is working.
    • Log file analysis is pointless. Controllers are heavily making make use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent includes allowing a situation to be recovered within the next few seconds.milliseconds.
  • A Controller Cluster guarantees high availability when used with a JS7 - Agent Cluster. Use of Standalone Agents limits high availability.

Further Resources