Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinh1. true
outlinh1. true
1printablefalse
2stylh1. none
3indent20px

Deutsche Version

Functioning of the Watchdog (heart_beat_watchdog_thread)

A Watchdog watchdog is started automatically each time a JobScheduler is started as part of a Cluster. Each Watchdog watchdog runs as a seperate thread alongside its respective JobScheduler and monitors that JobScheduler's heartbeat. The Watchdog stops its JobScheduler if after the JobScheduler's heartbeat is has been missing for a predefined length of time.

The JobScheduler heartbeat is an entry written in the database every 60 seconds. In addition to the monitoring by the a JobScheduler's own Watchdogwatchdog, all JobSchedulers in the cluster also monitor each other's heartbeats. One of these JobSchedulers will issue a warning if the heartbeat of a JobScheduler should be more than 3 seconds late. This warning will have no further consequences.

The other cluster members will recognise that a JobScheduler must have failed if two heartbeats in series are missing. They will then take over its work.

The Watchdog watchdog ensures that the missing heartbeats are not a temporary phenomenon and starts the termination process for the JobScheduler in question, shortly before the the two minute deadline of the other cluster members is reached (after 115 seconds). This is done in the assumption that the remaining cluster members are in the process of taking over the tasks of the terminated scheduler and ensures that tasks are not carried out by two JobSchedulers at the same time.

This behaviour cannot be configured as it is an "emergency" procedure to ensure the reliable functioning of the cluster.

Possible reasons for a missing heartbeat

  • Database problems
  • Problems with the SMTP mail server
  • DNS problems
  • A heavily overload computer (e.g. lack of memory)
  • A change in system time

...

Output to the log file scheduler.log

JobScheduler determines that its own heartbeat is missing 31 seconds after it was due. The warning is issued after a further delay of 3 seconds. The maximum delay that is tollerated is 55 seconds.

Code Block

 2013-09-12 12:26:18.230 [WARN]   (Cluster) 
 SCHEDULER-827  Own heart beat is late: next_heart_beat has been announced for 2013-09-12 12:25:47 
 (this is 31 seconds late)

This message comes from the Watchdogwatchdog. It has realsed realised that the last heartbeat from its JobScheduler was 136 seconds ago. The tollerance tolerance of 55 seconds delay (115s since the last heartbeat) has been exceeded.
The Watchdog watchdog terminates the JobScheduler before another cluster member can take over operation to ensure that the JobScheduler taking over does not end up, possibly later, operating in parallel with the original JobScheduler.

Code Block

 2013-09-12 12:28:19.393 [ERROR]  (Heart_beat_watchdog_thread) 
 SCHEDULER-386  Last heart beat was 2013-09-12 12:26:03, 136 seconds ago. Something is delaying
    Scheduler execution, the Scheduler is aborted immediately

The following message will be issued if a JobScheduler should realise it is late and manage to send out a heartbeat during the termination process:

Code Block

 2013-09-12 12:28:20.546 [WARN]   (Cluster) 
 SCHEDULER-827  Own heart beat is late: next_heart_beat has been announced for 2013-09-12 12:27:03 
 (this is 77 seconds late)

See also

...