Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Functioning of the Watchdog (Heart_beat_watchdog_thread)

The A Watchdog is started automatically when each time a JobScheduler is started as part of a Cluster. This monitors whether other JobSchedulers in the clúster have a healthy Each Watchdog runs as a seperate thread alongside its respective JobScheduler and monitors that JobScheduler's heartbeat. The Watchdog stops its JobScheduler is stopped if its if the JobScheduler's heartbeat is missing for a predefined length of time.

The JobScheduler heartbeat is an entry written in the database every 60 seconds. A warning is issued if there is a delay to a heartbeat of . In addition to the monitoring by the a JobScheduler's own Watchdog, all JobSchedulers in the cluster also monitor each other's heartbeats. One of these JobSchedulers will issue a warning if the heartbeat of a JobScheduler should be more than 3 seconds late. This warning has will have no further consequences.

If two heartbeats in series are missing then the The other cluster members will recognise that the corresponding a JobScheduler must have failed and if two heartbeats in series are missing. They will then take over its work.

The Watchdog ensures that the missing heartbeats are not a temporary phenomenon and starts the termination process for the JobScheduler in question, shortly before the the two minute deadline of the other cluster members is reached (after 115 seconds). This is done in the assumption that the remaining cluster members are in the process of taking over the tasks of the terminated scheduler . This and ensures that tasks are not carried out by two JobSchedulers at the same time.

...

JobScheduler determines that a its own heartbeat is missing 31 seconds after it was due. The warning is issued after a further delay of 3 seconds. The maximum delay that is tollerated is 55 seconds.

...