Page History

Table of Contents

	outlinh1. true
	outlinh1. true
1	printablefalse
2	stylh1. none
3	indent20px

Deutsche Version

Functioning of the Watchdog (

...

heart_beat_watchdog_thread)

The Watchdog A watchdog is started automatically when each time a JobScheduler is started as part of a Cluster. This monitors whether other JobSchedulers in the clúster have a healthy Each watchdog runs as a seperate thread alongside its respective JobScheduler and monitors that JobScheduler's heartbeat. The JobScheduler is stopped if its heartbeat is missing over Watchdog stops its JobScheduler after the JobScheduler's heartbeat has been missing for a predefined length of time.

Mit dem Herzschlag schreibt der JobScheduler periodisch alle 60 Sekunden ein Lebenszeichen in die Datenbank. Ab 3 Sekunden Verspätung gibt es eine folgenlose Warnung.
Bleiben zwei Herzschläge hintereinander (also zwei Minuten lang) aus, dann erkennen die anderen Mitglieder des Clusters daran, dass der Scheduler wohl ausgefallen ist
und übernehmen dessen Arbeit.

Der Watchdog wiederum stellt sicher, dass es sich bei dem ausgefallenen Herzschlag nicht um einen vorübergehende Erscheinung handelt, indem er kurz vor Ablauf der zwei
Minuten Frist (nach 115 Sekunden) die Beendigung des Schedulers einleitet, denn er muss davon ausgehen, dass andere Mitglieder im Cluster dabei sind, seine Aufgabe übernehmen.
So wird verhindert, dass zwei Scheduler die gleichen Aufgaben ausführen und es wegen doppelt laufender Jobs zu Problemen kommt.

The JobScheduler heartbeat is an entry written in the database every 60 seconds. In addition to the monitoring by the JobScheduler's own watchdog, all JobSchedulers in the cluster also monitor each other's heartbeats. One of these JobSchedulers will issue a warning if the heartbeat of a JobScheduler should be more than 3 seconds late. This warning will have no further consequences.

The other cluster members will recognise that a JobScheduler must have failed if two heartbeats in series are missing. They will then take over its work.

The watchdog ensures that the missing heartbeats are not a temporary phenomenon and starts the termination process for the JobScheduler in question, shortly before the the two minute deadline of the other cluster members is reached (after 115 seconds). This is done in the assumption that the remaining cluster members are in the process of taking over the tasks of the terminated scheduler and ensures that tasks are not carried out by two JobSchedulers at the same time.

This behaviour cannot be configured as it is an "emergency" procedure to ensure the reliable functioning of the clusterDieses Verhalten ist nicht konfigurierbar, da es sich hier um eine "Notabschaltung" handelt, die einen funtionierenden Clusterbetrieb sicherstellt.

Possible reasons for a missing heartbeat

Database problems
Problems with the SMTP mail server
DNS problems
A heavily overload computer (e.g. lack of memory)
A change in system time

...

Output to the log file scheduler.log

Der Scheduler bemerkt, dass sein Herzschlag nicht zur angekündigten erfolgt ist, sondern 31 Sekunden später. Die Warnung erscheint ab 3 Sekunden Verspätung. Toleriert werden bis zu 55 Sekunden VerspätungJobScheduler determines that its own heartbeat is missing 31 seconds after it was due. The warning is issued after a further delay of 3 seconds. The maximum delay that is tollerated is 55 seconds.

Code Block
2013-09-12 12:26:18.230 [WARN] (Cluster) SCHEDULER-827 Own heart beat is late: next_heart_beat has been announced for 2013-09-12 12:25:47 (this is 31 seconds late)

Diese Meldung kommt vom Watchdog. Er hat bemerkt, dass der letzte Herzschlag seines Schedulers 136 Sekunden zurückliegt. Die Toleranz von 55s Verspätung (115s seit letztem Schlag) ist überschritten.
Bevor nun ein anderes Mitglied des Clusters den Betrieb übernimmt, bricht der Watchdog den JobScheduler ab, damit dieser nicht doch noch irgendwann - parallel zum übernehmenden Mitglied - seine Arbeit fortsetzt.

This message comes from the watchdog. It has realised that the last heartbeat from its JobScheduler was 136 seconds ago. The tolerance of 55 seconds delay (115s since the last heartbeat) has been exceeded.
The watchdog terminates the JobScheduler before another cluster member can take over operation to ensure that the JobScheduler taking over does not end up, possibly later, operating in parallel with the original JobScheduler.

Code Block


 2013-09-12 12:28:19.393 [ERROR]  (Heart_beat_watchdog_thread) 
 SCHEDULER-386  Last heart beat was 2013-09-12 12:26:03, 136 seconds ago. Something is delaying
    Scheduler execution, the Scheduler is aborted immediately

Während des Abbbruchvorgangs hat es der JobScheduler doch noch zu einen weiteren Schlag geschafft und dabei seine Verpätung selbst bemerkt.

The following message will be issued if a JobScheduler should realise it is late and manage to send out a heartbeat during the termination process:

Code Block
Code Block
2013-09-12 12:28:20.546 [WARN] (Cluster) SCHEDULER-827 Own heart beat is late: next_heart_beat has been announced for 2013-09-12 12:27:03 (this is 77 seconds late)

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Functioning of the Watchdog (

heart_beat_watchdog_thread)

Possible reasons for a missing heartbeat

Output to the log file scheduler.log

See also

JobScheduler Backup Cluster in which the monitoring carried out by other JobSchedulers is described in more detail.