Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Resilience includes support for different outage scenarios with automated and manual fail-over.
  • Outage Scenarios
    • Network Connection Loss
      • A recoverable, temporary connection loss between Master and Agent for a configurable period of time, e.g. 20s.
      • Connection loss includes that the JobScheduler Master and Agent have no knowledge from the beginning if the connection failed or if a Master Service Failure occurred.
    • Master Service Failure
      • Either An an unrecoverable connection loss between Master and Agent that takes more time than the period specified for the Network Connection Loss scenario
        • due to a server crash .or
        • due to a JobScheduler Master crash.
      • Or an unplanned JobScheduler Master restart or server restart.
    • Database Connection Loss
      • A recoverable, temporary connection loss beetween Master and database:
        • for a JobScheduler Active Cluster this scenario includes a period of not more than 50s.
        • for a JobScheduler Passive Cluster this scenario includes no restriction of duration.
      • Connection loss includes that the JobScheduler Master has no knowledge if the database service failed or if the connection failed.

...

  • Reconciliation Scenario
    • applies after a Network Connection Loss between Master and Agent.
    • includes re-establishing the normal relationship between Master and Agent after a connection outage.
  • Agent Behavior
    • By default an Agent will kill any running tasks immediately if the connection to the Master gets lost, i.e. none of the above scenarios is supported (JS-1523). The reasons for this are:
      • If a Master were not available for a longer period then the Agent could not report back the execution history and log information for tasks. This would result in the fact that no information is available with the Master if the job execution has been successful or not.
      • The primary goal is to prevent duplicate execution of jobs. Without further information from a Master the respective Agent instance cannot know if later on it will be contacted for re-execution of the same job (which would allow to continue a currently running task on an Agent) or if the Master will choose a different Agent (see AvailabilityAgent Bundle).
    • With a Network Connection Loss setting configured with the Agent's process class the Agent will show the following behavior (JS-1524):
      • During the period specified for the tolerated connection loss duration the Agent will assume the Network Connection Loss scenario.
      • The Agent will continue any running tasks up to the end of the tolerated connection loss period.
        • Reconciliation will take place if the connection between Master and Agent can be re-established during the connection loss period and if the Master has not been restarted.
        • Otherwise the Agent will assume the Master Service Failure scenario and will kill any running tasks.
      • This behavior applies to tasks that are executed for a specific Master for which a connection has been lost. Tasks for other JobScheduler Master instances will be continued.
  • Master/Agent Reconciliation
    • After connection loss the Master will regularly attempt to re-establish the HTTP connection to the Agent. This communication includes a "tunnel" that allows the Agent to report the execution status of running jobs back to the Master.
    • After a successful re-connect within the Network Connection Loss scenario the Master will repeat its request for execution of the respective jobs. Each new request includes an identifier for the previous execution request that allows the Agent to identify repeated requests:
      • for a job that has been completed within the tolerated connection loss period the Agent will report the execution result back to the Master and will not re-execute the job.
      • for a job that is still running the Agent will report the appropriate information back to the Master which will note the running tasks and update JOC accordingly.
  • Delimitation
    • This feature is not intended to support a Master Service Failure scenario or Database Connection Loss scenario.
  • Feature Availability
    • Display feature availability
      StartingFromRelease1.10.2

...

  • After a Master Service Failure the JobScheduler Master can be started in paused mode
    • This start mode prevents all jobs from being started. This applies to 
      • jobs that have previously been requested for execution with Agents, 
      • jobs that have been enqueued and
      • jobs that are scheduled for execution using start time events.
    • All job starts that are delayed due to paused mode will be executed after the JobScheduler Master is continued
      • This also applies to jobs that are enqueued while paused mode is active.
      • The operation to continue JobScheduler is available with JOC.
    • Paused mode allows users to manually check the job history and optionally remove enqueued tasks if Agent Reconciliation has not taken place.
      • The Agent stores log files of jobs during execution. If an execution result cannot be reported to the Master then the log file will be retained, otherwise it will be removed (JS-1521).
      • Paused mode can be configured to be applied automatically in case of restart of a JobScheduler Master after failure (JS-1522).
  • Delimitation
    • The currently supported measures include manual checking of Agent task logs after failure. 
    • Automated recovery of the Master/Agent execution status after a Master Service Failure is subject to future improvements.
  • Feature Availability
    • Display feature availability
      StartingFromRelease1.10.2

...

  • Outage Scenario
    • Database Service Failure
      • a temporary connection loss between JobScheduler Master and its database.
  • Supported Scenario
    • Database Service Recovery addresses the Database Service Failure scenario, not any scenario for temporary Network Connection Loss or Master Service Failure.

Feature

tbd

Implementation

  • If the connection to a database gets lost or if a database failure occurs the JobScheduler Master will try to re-connect.
    • A Master single instance can be configured to repeat an unlimited number of connection attemps.
    • A Master Active Cluster member requires the database connection to become available within 120s. Otherwise the cluster member terminates in order to prevent duplicate execution of jobs.
  • For use with replicated databases keep in mind that the delay that is caused by replication might result in data loss. 
    • Depending on the DBMS this delay might be small, however, it might result in duplicate execution of jobs if the information about a previous job run is not available with the replicated database in case of fail-over.
    • Replicated databases are frequently used to achieve a database availability of up to approx 99.9%.
  • For use with clustered databases JobScheduler does not rely on vendor-specific connection continuity mechanisms but complies with JDBC standards and will simply re-connect after connection loss.
    • Unsupported vendor-specific mechanisms include e.g. SQL Server multi-subnet clustering or MySQL Galera JDBC fail-over that expect the client to switch the connection to some different address.
    • Instead, in case of fail-over the clustered database is expected to be available with the same connection attributes. This could include e.g. DNS switching to make a different database server the primary server in case of fail-over.
    • Clustered databases are frequently used to achieve a database availability of up to approx. 99.999%.

Implementation

Jira
serverSOS JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQuerykey in (JS-1283,JS-951,JS-1032,JS-1082,JS-1157)
serverId6dc67751-9d67-34cd-985b-194a8cdc9602
tbd