Introduction
The underlying scenario includes that users perform patching of hosts used by a JS7 environment. This is not related to JS7 - Patch Management of JS7 products, but to patching of a host a OS level.
In many situations patching includes rebooting the host. Users would like know in advance to what extent a reboot will affect JS7 scheduling operation. This implies use of clustering for JOC Cockpit, Controller and Agents, see JS7 - Cluster Architecture: in a JS7 cluster outage of one or two hosts allows to continue operation, outages of more hosts can make the cluster non-functional and can require manual intervention for automated fail-over and restart.
Examples for fatal outages in a cluster:
- if both Primary and Secondary JOC Cockpit instances are shutdown, then the Controller Cluster will continue to work. However, fail-over and restart of Controller instances will require user intervention.
- if both Primary and Secondary Controller instances are shutdown, then an Agent Cluster will continue to work. However, fail-over and restart of Director Agent instances will require user intervention.
Impact Check Script
The script makes use of the JS7 - Unix Shell CLI for JOC Cockpit Status Operations that offers the health-check
command with the --whatif-shutdown
option, see Examples for Health Checks.
The script is a stub that can be adjusted and applied for frequently used operations:
- The script is available for Linux and MacOS® using bash shell.
- The script terminates with exit code 0 to signal that the there will not be a fatal impact of the host shutdown scenario, other exit codes signal fatal impact on JS7 scheduling operation.
- The script is intended as a baseline example for customization by JS7 users and by SOS within the scope of professional services. Examples make use of JS7 Release 2.7.2, bash 4.2.
The below script checks hosts from a list - one after the next - for impact in case of shutdown.
- Users can limit health checks to clustered JS7 products. Shutdown of a Standalone Agent's host always has results in unavailability. Limiting health checks to clustered Agents using the
--agent-cluster
switch is recommended. - Users can improve performance
- by checking (and later patching) more than one host at the same time, using for example:
--whatif-shutdown=joc-2-0-primary,joc-2-0-secondary
. - by executing health checks for hosts in parallel.
- by checking (and later patching) more than one host at the same time, using for example:
#!/bin/bash # set common options for connection to the JS7 REST Web Service request_options=(--url=http://joc-2-0-primary.sos:7446 --user=root --password=root --ca-cert=./root-ca.crt --controller-id=controller --agent-cluster) # hosts to be patched hosts=(joc-2-0-primary joc-2-0-secondary controller-2-0-primary controller-2-0-secondary diragent-2-0-primary diragent-2-0-secondary) # max. number of tries in case of non-fatal problems tries=3 # delay in seconds between retries after non-fatal problems delay=15 for host in "${hosts[@]}"; do echo "--------------------------------------------------------" echo "CHECKING IMPACT OF HOST SHUTDOWN: $host" echo "--------------------------------------------------------" try=1 while [ "$try" -le "$tries" ]; do echo "" echo "TRY $try/$tries: ./bin/operate-joc.sh health-check "${request_options[@]}" --whatif-shutdown=$host" echo "" ./bin/operate-joc.sh health-check "${request_options[@]}" --whatif-shutdown="$host" rc=$? echo -n "" case "$rc" in 0) break; ;; 3) sleep "$delay" ;; *) exit "$rc" ;; esac try=$((try+1)) done if [ "$rc" -eq 0 ] then echo "PATCH CAN BE APPLIED TO HOST: $host" # add your code for patching else echo "PATCH CANNOT BE APPLIED TO HOST: $host, Exit Code: $rc" # add your code for error handling fi echo "" done
Explanations:
- Line 7: specifies the list of hostnames used by clustered JS7 products
- Line 10: specifies the maximum number of tries to perform the health-check. After reboot of a host it can take a few seconds until a cluster is re-established.
- Line 13: specifies the delay between tries. The value should be adjusted if it takes the cluster more time to recouple.
- Line 29-36: evaluates the health check result,
- exit code 0 signals an operational cluster,
- exit code 3 signals that the cluster is not (yet) functional,
- other exit codes signal component status errors, for example an unavailable Agent.