Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Primary and Secondary Controller instances require a single Agent to be available that acts as an arbitrator in case of fail-over and switch-over.

Start-up of Controller

...

Cluster

On start-up both Primary and Secondary Controller instances connect to the Cluster Watch Agent.

  • The Cluster Watch Agent give its vote casts its vote which Controller instance owns the leading journal.
  • Start-up of a Controller Cluster is not possible with the Cluster Watch Agent being unavailable.
  • Start-up of the previously Active Controller Instance without the Standby Controller Instance being available is possible if the Cluster Watch Agent is active.

Failure of the Active Controller Instance

xIn case of failure of the Active Controller Instance the Cluster Watch Agent plays its role as an arbitrator:

  • The Cluster Watch Agent Cluster Watcher knows  knows immediately when the active Active Controller instance Instance is down due to a connection loss from this instance.
  • The standby Controller instance similarly has Standby Controller Instance holds a connection to the active Active Controller instance Instance and knows immediately when this connection is interruptedlost.
  • This Failure of the Active Controller Instance is the point in time when passive the Standby Controller instance Instance and the Cluster Watch Agent Cluster Watcher check if they find “common ground”. This works similar to a funeral society, they to find common ground about a cluster fail-over operation: They determine if they consider should declare the active Active Controller instance Instance being dead inoperable and after a very short period (1-2s of crying tears) of 2-3s they proceed and give cast their 2 votes if the passive Standby Controller instance Instance should now become the active one.the Active Controller Instance.
  • As a pre-requisite for fail-over both the Cluster Watch Agent and the Standby Controller Instance have to confirm that the Standby Controller Instance's journal was in sync with the Active Controller Instance at the point in time of failure.

Operation of Cluster Watch Agent

Above explanations suggest that a Cluster Watch Agent may never be running on the hosts that the Primary and Secondary Controller instances are operated for.

  • If the Cluster Watch Agent is terminated at the same time as one of the Controller instances then the Controller Cluster cannot start up as this requires operational readiness of the Cluster Watch Agent.
  • A Cluster Watch Agent that is started after failure of the Active Controller Instance is disqualified from casting its vote as it has no knowledge if the Controller instances' journals are  in sync.

Cluster Cluster Operations

Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.

...

  • The cluster has to guarantee that only one of both Controller instances is started at any point in time.
    • Should this rule not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • The only solution is to drop both Controller instance's journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller but instance. However, a positive PID file check does not prove that a Controller instance is workingworks.
    • Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent a situation to be recovered within the next few seconds.

...