Re: HPE Service Guard 15.00.01 on SLES 15 SP4 + SAP HANA scale-up

Alex116 · ‎08-28-2024

Dear experts, help me solve the problem..
there are 2 nodes. s4hanadba001p and s4hanadba01pr they are on different sites and one SAP HANA DB SID - SHP.
It is necessary to create a configuration of SAP HANA scale-up, active/standby.
I created a cluster configuration, created packages using the command: deploysappkgs -i 192.168.65.41 -i 10.8.23.41 –o smart_quorum multi SHP.
Having specified virtual addresses on which the primary package hdbpSHP will be up. Primary pakage hdbpSHP.conf and secondary pakage hdbsSHP.conf.
Now the cluster works like this: the hdbpSHP package on the s4hanadba001p node and the hdbsSHP package on the s4hanadba01pr node.
2 packages cannot work on the same node, this is important because one database is the source (active), the second database is the target (standby) .
Both packages AUTO_RUN = enabled. I am doing a test: reboot s4hanadba001p. the cluster does not stop the hdbsSHP package and does not start hdbpSHP on the s4hanadba01pr node. hdbpSHP is halted.
the node s4hanadba001p starts and the package hdbpSHP start again on this node!!
how to make the cluster stop the hdbsSHP package on node s4hanadba01pr and start the hdbpSHP package on node s4hanadba01pr for change direction of replication???

cluster config:

CLUSTER_NAME HANA_SHP

# Definition of cluster sites.
#
# SITE_NAME entries define a site to which a NODE_NAME may be subsequently
# assigned using the SITE attribute.

# SITE_NAME

SITE_NAME SHPsiteA
SITE_NAME SHPsiteB

# The HOSTNAME_ADDRESS_FAMILY parameter specifies the Internet Protocol address
# family to which Serviceguard will attempt to resolve cluster node names and
# quorum server host names.
# If the parameter is set to IPV4, Serviceguard will attempt to resolve the names
# to IPv4 addresses only. This is the default value.
# If the parameter is set to IPV6, Serviceguard will attempt to resolve the names
# to IPv6 addresses only. No IPv4 addresses need be configured on the system or
# listed in the /etc/hosts file except for IPv4 loopback address.
# If the parameter is set to ANY, Serviceguard will attempt to resolve the names
# to both IPv4 and IPv6 addresses. The /etc/hosts file on each node must contain
# entries for all IPv4 and IPv6 addresses used throughout the cluster including
# all STATIONARY_IP and HEARTBEAT_IP addresses as well as any other addresses

HOSTNAME_ADDRESS_FAMILY IPV4

# Cluster Lock Parameters
# The cluster lock is used as a tie-breaker for situations
# in which a running cluster fails, and then two equal-sized
# sub-clusters are both trying to form a new cluster. The
# cluster lock may be configured using only one of the
# following alternatives on a cluster:
# the lock LUN
# the quorom server
#
#
# Consider the following when configuring a cluster.
# For a two-node cluster, you must use a cluster lock. For
# a cluster of three or four nodes, a cluster lock is strongly
# recommended. For a cluster of more than four nodes, a
# cluster lock is recommended. If you decide to configure
# a lock for a cluster of more than four nodes, it must be
# a quorum server.

# LUN lock disk parameters. Use the CLUSTER_LOCK_LUN parameter
# to define the device on a per node basis. The device may only
# be used for this purpose and by only a single cluster.
#
# Example for a smart array cluster disk
# CLUSTER_LOCK_LUN /dev/cciss/c0d0p1
# Example for a non smart array cluster disk
# CLUSTER_LOCK_LUN /dev/sda1
# NOTE: Ensure that the disk used for CLUSTER_LOCK_LUN is not an iSCSI disk.

# Quorum Server Parameters. Use the QS_HOST, QS_ADDR, QS_POLLING_INTERVAL,
# and QS_TIMEOUT_EXTENSION parameters to define a quorum server. The QS_HOST
# and QS_ADDR are either the host name or IP address of the system that is
# running the quorum server process. More than one IP address can be
# configured for the quorum server. When one subnet fails, Serviceguard
# uses the next available subnet to communicate with the quorum server.
# QS_HOST is used to specify the quorum server and QS_ADDR can be used to
# specify additional IP addresses for the quorum server. The QS_HOST entry
# must be specified (only once) before any other QS parameters. Only
# one QS_ADDR entry is used to specify the additional IP address.
# Both QS_HOST and QS_ADDR should not resolve to the same IP address.
# Otherwise cluster configuration will fail. All subnets must be up
# when you use cmapplyconf and cmquerycl to configure the cluster.
# The QS_POLLING_INTERVAL is the interval (in microseconds) at which
# Serviceguard checks to sure the quorum server is running.
#
# Heartbeat health based quorum mechanism allows the quorum server to consider
# the heartbeat health status of cluster node(s) while resolving split-brain situation.
# When enabled, the node with healthy heartbeat network will survive over the
# other node, and heartbeat health status gets the highest preference in arbitration.
# The feature may be enabled for two nodes cluster deployment only and
# to enable this feature, set QS_HB_HEALTH to ON.
#
# Smart quorum server feature will allow the the quorum server to take smart
# decision of granting a vote to a subgroup when the split-brain happens in
# the cluster. This feature arbitrates to allow the subgroup running the
# critical workload to survive there by eliminating unnecessary failover
# and restart of the workload. To enable this feature set QS_SMART_QUORUM
# parameter to ON.
#
# Legal values for QS_HB_HEALTH are: OFF(the default value), ON.
# Legal values for QS_SMART_QUORUM are: OFF(the default value), ON.
#
# If both of these QS features are enabled, cluster heartbeat health gets
# preference over critical workload. For example NodeB with non critical workload
# and good heartbeat health, gets preference over nodeA with critical workload
# and bad heartbeat health.
#
# If either (or both) of these features are enabled, QS_ARBITRATION_WAIT becomes a
# mandatory parameter with a minimum value of 3000000(in microseconds).

# The optional QS_TIMEOUT_EXTENSION (in microseconds) is used to increase
# the time allocated for quorum server response. The default quorum
# server timeout is calculated primarily from MEMBER_TIMEOUT parameter.
# For cluster of up to 4 nodes it is 0.2*MEMBER_TIMEOUT. It increases
# as number of nodes increases and reaches to 0.5*MEMBER_TIMEOUT for
# 16 nodes
#
# If quorum server is configured on busy network or if quorum server
# polling is experiencing timeouts (syslog messages) or if quorum server
# is used for large number of clusters, such default time (as mentioned
# above) might not be sufficient. In such cases this parameter should be
# use to provide more time for quorum server.
# Also this parameter deserves more consideration if small values
# for MEMBER_TIMEOUT is used.
#
# The value of QS_TIMEOUT_EXTENSION will directly effect the amount of
# time it takes for cluster reformation in the event of node failure. For
# example, if QS_TIMEOUT_EXTENSION is set to 10 seconds, the cluster
# reformation will take 10 seconds longer than if the QS_TIMEOUT_EXTENSION
# was set to 0. This delay applies even if there is no delay in contacting
# the Quorum Server.
#
# The recommended value for QS_TIMEOUT_EXTENSION is 0 (the default value),
# and the maximum supported value is 300000000 (5 minutes).
#
# QS_ARBITRATION_WAIT specifies how long the QS will wait for the subgroup with
# the critical work load (ACTIVE Site) to approach the QS. If the request is
# first received from the ACTIVE site then the QS will immediately grant quorum
# to the ACTIVE site. If the request is first received from the NON-ACTIVE site
# (site not running the critical workload) then the QS will wait for a request
# from the ACTIVE site. If the request is received from the ACTIVE site before
# the QS_ARBITRATION_WAIT expires then QS will grant quorum to the ACTIVE site
# else it will grant quorum to the NON-ACTIVE site.
# This parameter will be used only when the smart quorum feature is enabled.
# Also it is mandatory to configure this when using the smart quorum feature.
# The value of QS_ARBITRATION_WAIT will also directly affect the cluster
# reformation time in case of node failure. For example if this value is set to
# 5 seconds, the cluster reformation will take 5 seconds longer than normal
# circumstances. The recommended value for QS_ARBITRATION_WAIT is 3000000
# (3 seconds, the default value), and the maximum supported value is 300000000
# (5 minutes).
#
# For example, to configure a quorum server running on node "qs_host"
# with the additional IP address "qs_addr" and with 120 seconds for the
# QS_POLLING_INTERVAL and to add 2 seconds to the system assigned value
# for the quorum server timeout, enter
#
# QS_HOST qs_host
# QS_ADDR qs_addr
# QS_POLLING_INTERVAL 120000000
# QS_TIMEOUT_EXTENSION 2000000

QS_HOST sgqshana
QS_POLLING_INTERVAL 30000000

# In addition to above example, if quorum server supports smart quorum
# or heartbeat health based quorum features, then turn ON the QS_SMART_QUORUM
# and QS_HB_HEALTH parameters respectively, with required QS_ARBITRATION_WAIT value.
#
#
#QS_HB_HEALTH OFF
QS_SMART_QUORUM ON
QS_ARBITRATION_WAIT 3000000

# VCENTER_SERVER is an optional parameter. This must be configured when VMware
# Virtual Machine File System(VMFS) disks are to be used in packages. This
# specifes the alias or IP address of the vCenter server which manages the guest nodes.
# For using VMFS disks in packages, alternatively one can also configure the
# ESX_HOST parameter. Please note only one of the options can be used.
# Provide either ESX_HOST for all cluster nodes that are VMware guests or provide
# one VCENTER_SERVER that manages all the VMware guests, that will be configured
# as cluster nodes.
# VCENTER_SERVER

# Definition of nodes in the cluster.
# Repeat node definitions as necessary for additional nodes.
# NODE_NAME is the specified nodename in the cluster.
# It must match the hostname and both cannot contain full domain name.
# SITE is an optional attribute which is specified after the NODE_NAME.
# SITE attribute is useful to define a node in a particular site.
# Note: Each NODE_NAME can have only one SITE.
# SITE can be defined if SITE_NAME has been defined.
# SITE entry must contain one of the SITE_NAME entries previously defined.
# The below example shows that the node TEST1 belongs to site TEST_SITE1
# NODE_NAME TEST1
# SITE TEST_SITE1

# ESX_HOST is an optional parameter. This must be configured when VMware
# Virtual Machine File System(VMFS) disks are to be used in packages. This
# parameter should be entered after the NODE_NAME parameter and is used to
# specify the alias or IP address of the ESX Host on which the guest specified
# by the preceding NODE_NAME is configured. Each NODE_NAME can have only one
# ESX_HOST. When using the ESX_HOST parameter ensure that the same is similarly
# populated for all the nodes in the cluster which are VMware guests.
# For using VMFS disks in packages, alternatively one can also configure the
# VCENTER_SERVER parameter. Please note only one of the options can be used.
# Provide either ESX_HOST for all cluster nodes that are VMware guests or provide
# one VCENTER_SERVER that manages all the VMware guests, that will be configured
# as cluster nodes.
# The below example shows that the node XYZ belongs to ESX Host ESX_XYZ
# NODE_NAME XYZ
# ESX_HOST ESX_XYZ
# Each NETWORK_INTERFACE, if configured with IPv4 address,
# must have ONLY one IPv4 address entry with it which could
# be either HEARTBEAT_IP or STATIONARY_IP.
# Each NETWORK_INTERFACE, if configured with IPv6 address(es)
# can have multiple IPv6 address entries (up to a maximum of 2,
# only one IPv6 address entry belonging to site-local scope
# and only one belonging to global scope) which could be
# either HEARTBEAT_IP or STATIONARY_IP.

# Note: This configuration contains IPv4 STATIONARY_IP or HEARTBEAT_IP
# addresses. To obtain an IPv6-only cluster on supported platforms,
# comment out any IPv4 STATIONARY_IPs or HEARTBEAT_IPs.
# If this leaves any NETWORK_INTERFACE without any STATIONARY_IP or
# HEARTBEAT_IP, comment out the NETWORK_INTERFACE as well.
# Modify the resulting configuration as necessary to meet the
# hearbeat requirements and recommendations for a Serviceguard
# configuration. These are spelled out in chapter 4 of the Managing
# Serviceguard manual.
#
# Node capacity parameters. Use the CAPACITY_NAME and CAPACITY_VALUE
# parameters to define a capacity for the node. Node capacities correspond to
# package weights; node capacity is checked against the corresponding
# package weight to determine if the package can run on that node.
#
# CAPACITY_NAME specifies a name for the capacity.
# The capacity name can be any string that starts and ends with an
# alphanumeric character, and otherwise contains only alphanumeric characters,
# dot (.), dash (-), or underscore (_). Maximum string
# length is 39 characters. Duplicate capacity names are not allowed.
#
# CAPACITY_VALUE specifies a value for the CAPACITY_NAME that precedes
# it. This is a floating point value between 0 and 1000000. Capacity values
# are arbitrary as far as Serviceguard is concerned; they have meaning only in
# relation to the corresponding package weights.

# Node capacity definition is optional, but if CAPACITY_NAME is specified,
# CAPACITY_VALUE must also be specified; CAPACITY_NAME must come first.
# To specify more than one capacity, repeat this process for each capacity.

# NOTE: If a given capacity is not defined for a node, Serviceguard assumes
# that capacity is infinite on that node. For example, if pkgA, pkgB, and pkgC
# each specify a weight of 1000000 for WEIGHT_NAME "memory", and CAPACITY_NAME
# "memory" is not defined for node1, then all three packages are eligible
# to run at the same time on node1, assuming all other requirements are met.
#
# Cmapplyconf will fail if any node defines a capacity and
# any package has min_package_node as the failover policy or
# has automatic as the failback policy.

# You can define a maximum of 8 capacities.
#
# NOTE: Serviceguard supports a capacity with the reserved name
# "package_limit". This can be used to limit the number of packages
# that can run on a node. If you use "package_limit", you cannot
# define any other capacities for this cluster, and the default
# weight for all packages is 1.
#
# Example:
# CAPACITY_NAME package_limit
# CAPACITY_VALUE 4
#
# This allows a maximum of four packages to run on this node,
# assuming each has the default weight of one.
#
# For all capacities other than "package_limit", the default weight for
# all packages is zero
#

#NODE_NAME s4hanaapp001p
#SITE SHPsiteA
## Cloud compute name provided for the node
##CLOUD_COMPUTE_RESOURCE_NAME
## Cloud compute resource group for the node
##CLOUD_COMPUTE_RESOURCE_GROUP
# NETWORK_INTERFACE eth0
# HEARTBEAT_IP 192.168.65.15
# NETWORK_INTERFACE eth2
# HEARTBEAT_IP 10.8.22.70
## CLUSTER_LOCK_LUN

# Route information
# route id 1: 192.168.65.15
# route id 2: 10.8.22.70
# CAPACITY_NAME
# CAPACITY_VALUE

# Warning: There are no standby network interfaces for eth0.
# Warning: There are no standby network interfaces for eth2.

#NODE_NAME s4hanaapp01pr
#SITE SHPsiteB
## Cloud compute name provided for the node
##CLOUD_COMPUTE_RESOURCE_NAME
## Cloud compute resource group for the node
##CLOUD_COMPUTE_RESOURCE_GROUP
# NETWORK_INTERFACE eth0
# HEARTBEAT_IP 10.8.23.15
# NETWORK_INTERFACE eth2
# HEARTBEAT_IP 10.8.22.102
## CLUSTER_LOCK_LUN

# Route information
# route id 3: 10.8.23.15
# route id 4: 10.8.22.102
# CAPACITY_NAME
# CAPACITY_VALUE

# Warning: There are no standby network interfaces for eth0.
# Warning: There are no standby network interfaces for eth2.

NODE_NAME s4hanadba001p
SITE SHPsiteA
# Cloud compute name provided for the node
#CLOUD_COMPUTE_RESOURCE_NAME
# Cloud compute resource group for the node
#CLOUD_COMPUTE_RESOURCE_GROUP
NETWORK_INTERFACE bond0
HEARTBEAT_IP 192.168.65.20
NETWORK_INTERFACE bond1
STATIONARY_IP 10.8.24.3
NETWORK_INTERFACE bond2
STATIONARY_IP 10.8.24.11
NETWORK_INTERFACE bond3
HEARTBEAT_IP 10.8.22.67
# CLUSTER_LOCK_LUN

# Route information
# route id 1: 192.168.65.20
# route id 5: 10.8.24.3
# route id 6: 10.8.24.11
# route id 2: 10.8.22.67
# CAPACITY_NAME
# CAPACITY_VALUE

# Warning: There are no standby network interfaces for bond0.
# Warning: There are no standby network interfaces for bond1.
# Warning: There are no standby network interfaces for bond2.
# Warning: There are no standby network interfaces for bond3.

NODE_NAME s4hanadba01pr
SITE SHPsiteB
# Cloud compute name provided for the node
#CLOUD_COMPUTE_RESOURCE_NAME
# Cloud compute resource group for the node
#CLOUD_COMPUTE_RESOURCE_GROUP
NETWORK_INTERFACE bond0
HEARTBEAT_IP 10.8.23.20
NETWORK_INTERFACE bond1
STATIONARY_IP 10.8.24.19
NETWORK_INTERFACE bond2
STATIONARY_IP 10.8.24.27
NETWORK_INTERFACE bond3
HEARTBEAT_IP 10.8.22.99
# CLUSTER_LOCK_LUN

# Route information
# route id 3: 10.8.23.20
# route id 4: 10.8.22.99
# route id 7: 10.8.24.19
# route id 8: 10.8.24.27
# CAPACITY_NAME
# CAPACITY_VALUE

# Warning: There are no standby network interfaces for bond0.
# Warning: There are no standby network interfaces for bond3.
# Warning: There are no standby network interfaces for bond1.
# Warning: There are no standby network interfaces for bond2.

# Cluster Timing Parameters (microseconds).

# The MEMBER_TIMEOUT parameter defaults to 14000000 (14 seconds).
# If a heartbeat is not received from a node within this time, it is
# declared dead and the cluster reforms without that node.
# A value of 10 to 25 seconds is appropriate for most installations.
# For installations in which the highest priority is to reform the cluster
# as fast as possible, a setting of as low as 3 seconds is possible.
# When a single heartbeat network with standby interfaces is configured,
# MEMBER_TIMEOUT cannot be set below 14 seconds if the network interface
# type is Ethernet, or 22 seconds if the network interface type is
# InfiniBand (HP-UX only).
# Note that a system hang or network load spike whose duration exceeds
# MEMBER_TIMEOUT will result in one or more node failures.
# The maximum value recommended for MEMBER_TIMEOUT is 60000000
# (60 seconds).

MEMBER_TIMEOUT 10000000

# Configuration/Reconfiguration Timing Parameters (microseconds).

AUTO_START_TIMEOUT 600000000
NETWORK_POLLING_INTERVAL 2000000

# You can use the optional CONFIGURED_IO_TIMEOUT_EXTENSION parameter
# to increase the amount of time (in microseconds) that Serviceguard
# will wait to ensure that all pending I/O on a failed node has ceased.
# To ensure data integrity, you must set this parameter in the following
# cases: for an extended-distance cluster using software mirroring across
# data centers over links between iFCP switches; and for any cluster in
# which packages use NFS mounts. See the section on cluster configuration
# parameters in the 'Managing Serviceguard' manual for more information.
# The default value of CONFIGURED_IO_TIMEOUT_EXTENSION parameter is 0.
# Serviceguard supports the CONFIGURED_IO_TIMEOUT_EXTENSION parameter values
# in the range 0 to (2^31)-1 [2147483647].

# CONFIGURED_IO_TIMEOUT_EXTENSION 0

# IP Monitor Configuration Parameters.
# The following set of three parameters can be repeated as necessary.
# SUBNET is the subnet to be configured whether or not to be monitored
# at IP layer.
# IP_MONITOR is set to ON if the subnet is to be monitored at IP layer.
# IP_MONITOR is set to OFF if the subnet is not to be monitored at IP layer.
# POLLING_TARGET is the IP address to which polling messages are sent
# from each network interface in the subnet.
# Each SUBNET can have multiple polling targets, so multiple
# POLLING_TARGET entries can be specified. If no POLLING_TARGET is
# specified, peer interfaces in the subnet will be polling targets for each other.
# Only subnets with a gateway that is configured to accept
# ICMP Echo Request messages will be included by default with IP_MONITOR
# set to ON, and with its gateway listed as a POLLING_TARGET.
SUBNET 192.168.65.0
IP_MONITOR OFF
#POLLING_TARGET 192.168.65.1

SUBNET 10.8.22.64
IP_MONITOR OFF
# POLLING_TARGET 10.8.22.65

SUBNET 10.8.23.0
IP_MONITOR OFF
#POLLING_TARGET 10.8.23.1

SUBNET 10.8.22.96
IP_MONITOR OFF
# POLLING_TARGET 10.8.22.97

SUBNET 10.8.24.0
IP_MONITOR OFF
# POLLING_TARGET 10.8.24.1

SUBNET 10.8.24.8
IP_MONITOR OFF
# POLLING_TARGET 10.8.24.9

SUBNET 10.8.24.16
IP_MONITOR OFF
# POLLING_TARGET 10.8.24.17

SUBNET 10.8.24.24
IP_MONITOR OFF
# POLLING_TARGET 10.8.24.25

# Package Configuration Parameters.
# Enter the maximum number of packages which will be configured in the cluster.
# You can not add packages beyond this limit.
# This parameter is required.
MAX_CONFIGURED_PACKAGES 300

# Load Balancing feature places a package during failover, on such nodes of the
# cluster, so as to balance the load in the cluster, according to weights
# configured. Only one type of capacity can be defined when load balancing
# is turned on and all nodes must specify this capacity as infinite. To turn
# on load balancing in the cluster set it to ON.
# Legal values for LOAD_BALANCING : OFF, ON.
#LOAD_BALANCING OFF

# Root Disk Monitoring allows to monitor root disks of cluster nodes.
# Legal values for ROOT_DISK_MONITOR : OFF, ON.
# Default value is OFF.

# The interval at which Serviceguard validates if root disk of cluster node(s) is healthy.
# Legal values for ROOT_DISK_MONITOR_INTERVAL in microseconds
# between 1000000 (1 second) and 1800000000 (30 minutes).
# Default value is 30000000 microseconds (30 seconds).

# This optional attribute should include Space separated list of
# nodes or host names to be excluded from Root Disk Monitoring.
# Nodes with Serviceguard that does not support Root Disk
# Monitoring will be automatically excluded.
# ROOT_DISK_MONITOR_EXCLUDE_NODES NODE1 NODE2 ...
ROOT_DISK_MONITOR OFF
#ROOT_DISK_MONITOR_INTERVAL 30000000
#ROOT_DISK_MONITOR_EXCLUDE_NODES

# Optional package default weight parameters. Use WEIGHT_NAME and
# WEIGHT_DEFAULT parameters to define a default value for this weight
# for all packages except system multi-node packages.
# Package weights correspond to node capacities; node capacity
# is checked against the corresponding package weight to determine
# if the package can run on that node.
#
# WEIGHT_NAME
# specifies a name for a weight that corresponds to a
# capacity specified earlier in this file. Weight is defined for
# a package, whereas capacity is defined for a node. For any given
# weight/capacity pair, WEIGHT_NAME, CAPACITY_NAME (and weight_name
# in the package configuration file) must be the same. The rules for
# forming all three are the same. See the discussion of the capacity
# parameters earlier in this file.

# NOTE: A weight (WEIGHT_NAME/WEIGHT_DEFAULT) has no meaning on a node
# unless a corresponding capacity (CAPACITY_NAME/CAPACITY_VALUE) is
# defined for that node.
# For example, if CAPACITY_NAME "memory" is not defined for
# node1, then node1's "memory" capacity is assumed to be infinite.
# Now even if pkgA, pkgB, and pkgC each specify the maximum weight
# of 1000000 for WEIGHT_NAME "memory", all three packages are eligible
# to run at the same time on node1, assuming all other requirements are met.
#
# WEIGHT_DEFAULT specifies a default weight for this WEIGHT_NAME.
# This is a floating point value between 0 and 1000000.
# Package weight default values are arbitrary as far as Serviceguard is
# concerned; they have meaning only in relation to the corresponding node
# capacities.
#
# The package weight default parameters are optional. If they are not
# specified, a default value of zero will be assumed. If defined,
# WEIGHT_DEFAULT must follow WEIGHT_NAME. To specify more than one package
# weight, repeat this process for each weight.
# Note: for the reserved weight "package_limit", the default weight is
# always one. This default cannot be changed in the cluster configuration file,
# but it can be overriden in the package configuration file.
#
# For any given package and WEIGHT_NAME, you can override the WEIGHT_DEFAULT
# set here by setting weight_value to a different value for the corresponding
# weight_name in the package configuration file.
#
# Cmapplyconf will fail if you define a default for a weight and no node
# in the cluster specifies a capacity of the same name.
# You can define a maximum of 8 weight defaults
#
# Example: The following example defines a default for "processor" weight
# of 0.1 for the package:
#
# WEIGHT_NAME processor
# WEIGHT_DEFAULT 0.1
#
# WEIGHT_NAME
# WEIGHT_DEFAULT

# Access Control Policy Parameters.
#
# Three entries set the access control policy for the cluster:
# First line must be USER_NAME, second USER_HOST, and third USER_ROLE.
# Enter a value after each.
#
# 1. USER_NAME can either be ANY_USER, or a maximum of
# 8 login names from the /etc/passwd file on user host.
# The following special characters are NOT supported for USER_NAME
# ' ', '/', '\', '*'
# 2. USER_HOST is where the user can issue Serviceguard commands.
# If using Serviceguard Manager, it is the COM server.
# Choose one of these three values: ANY_SERVICEGUARD_NODE, or
# (any) CLUSTER_MEMBER_NODE, or a specific node. For node,
# use the official hostname from domain name server, and not
# an IP addresses or fully qualified name.
# 3. USER_ROLE must be one of these three values:
# * MONITOR: read-only capabilities for the cluster and packages
# * PACKAGE_ADMIN: MONITOR, plus administrative commands for packages
# in the cluster
# * FULL_ADMIN: MONITOR and PACKAGE_ADMIN plus the administrative
# commands for the cluster.
#
# Access control policy does not set a role for configuration
# capability. To configure, a user must log on to one of the
# cluster's nodes as root (UID=0). Access control
# policy cannot limit root users' access.
#
# MONITOR and FULL_ADMIN can only be set in the cluster configuration file,
# and they apply to the entire cluster. PACKAGE_ADMIN can be set in the
# cluster or a package configuration file. If set in the cluster
# configuration file, PACKAGE_ADMIN applies to all configured packages.
# If set in a package configuration file, PACKAGE_ADMIN applies to that
# package only.
#
# MONITOR is set by default in a new cluster configuration as of Serviceguard
# release A.11.19.00. This is to support cluster discovery from other HPE
# Administration products such as Systems Insight Manager (HPESIM) and
# Distributed Systems Administration (DSAU) tools. Removing MONITOR is allowed
# as an online configuration change within Serviceguard. However removing MONITOR
# will break cluster management for HPESIM and HPEVSE products
#
# Conflicting or redundant policies will cause an error while applying
# the configuration, and stop the process. The maximum number of access
# policies that can be configured in the cluster is 200.
#
# Example: to configure a role for user john from node noir to
# administer a cluster and all its packages, enter:
# USER_NAME john
# USER_HOST noir
# USER_ROLE FULL_ADMIN

USER_NAME ANY_USER
USER_HOST ANY_SERVICEGUARD_NODE
USER_ROLE MONITOR

# Cluster Generic Resource(s)
#
# Cluster generic resource is specified with the following
# parameters: "GENERIC_RESOURCE_NAME", "GENERIC_RESOURCE_TYPE",
# "GENERIC_RESOURCE_SCOPE", "GENERIC_RESOURCE_CMD",
# "GENERIC_RESOURCE_RESTART" and GENERIC_RESOURCE_HALT_TIMEOUT.
#
# To define a cluster generic resource, a "GENERIC_RESOURCE_NAME"
# line is required.
#
# "GENERIC_RESOURCE_TYPE" can be SIMPLE or EXTENDED.
# SIMPLE generic resource values can be UP, DOWN or UNKNOWN.
# EXTENDED generic resource valid values are positive integers
# ranging from 1 to 2147483647.
#
# "GENERIC_RESOURCE_CMD" is the command line to be executed to
# start and stop the monitoring of a cluster generic resource
#
# "GENERIC_RESOURCE_SCOPE" can be set to NODE, SITE, or CLUSTER.
# NODE Scope generic resource values are unique across all
# nodes in a cluster.
# SITE Scope generic resource values / status remains same
# across all nodes of the Site in the cluster.
# CLUSTER Scope generic resource values / status remains same
# across all nodes of the cluster.
# Default Value is NODE.
#
# The value for "GENERIC_RESOURCE_RESTART" can be "unlimited",
# "none" or any positive integer value ranging from 1 to 2147483646.
# If the value is "unlimited" the resource command will be restarted
# an infinite number of times. If the value is "none", the resource
# will not be restarted. If the value is a positive integer, the resource
# command will be restarted the specified number of times before failing.
# If "GENERIC_RESOURCE_RESTART" is not specified, the default will
# be "none".
#
# "GENERIC_RESOURCE_HALT_TIMEOUT" is the time in micro seconds
# used to determine the duration Serviceguard will wait for the
# command specified in "GENERIC_RESOURCE_CMD" to halt before
# a SIGKILL signal is sent to force the termination of the resource
# command. In the event of a halt, Serviceguard will first send
# a SIGTERM signal to terminate the command. If the command does not halt,
# Serviceguard will wait for the specified "GENERIC_RESOURCE_HALT_TIMEOUT",
# then send the SIGKILL signal to force the resource command to terminate.
# This timeout value should be large enough to allow all cleanup processes
# associated with the command to complete.
# If the "GENERIC_RESOURCE_HALT_TIMEOUT" is not specified, a zero
# timeout will be assumed, meaning the cluster software will not wait at all
# before sending the SIGKILL signal to halt the command.
# The maximum value supported for GENERIC_RESOURCE_HALT_TIMEOUT is
# 120000000 microseconds (2 minutes).
#
# The attribute "GENERIC_RESOURCE_NOTIFY_FLAG" allows to subscribe for
# notifications when;
# * Node status in the cluster changes.
# * Cluster generic value for which it is configured changes.
#
# When specified, the cluster generic resource for which this
# attribute is configured will be notified. This is an optional
# attribute whose legal values are 00, 01, 10, 11.
# The table illustrates the valid values of GENERIC_RESOURCE_NOTIFY_FLAG
# based on the GENERIC_RESOURCE_SCOPE configured.
#
# |---------------------------------------------------------------|
# |SL No|VALUE|SCOPE |Details |
# |-----|-----|-------------------|-------------------------------|
# | 1 |00 |NA |No subscription. Default value |
# |-----|-----|-------------------|-------------------------------|
# | 2 |01 |NODE, SITE, CLUSTER|Node status change |
# |-----|-----|-------------------|-------------------------------|
# | 3 |10 |SITE, CLUSTER |Cluster generic resource change|
# |-----|-----|-------------------|-------------------------------|
# | 4 |11 |SITE, CLUSTER |For both 2 & 3 |
# |-----|-----|-------------------|-------------------------------|
#
# Linux OS signals are used to deliver notification to the cluster
# generic resource for which this attribute is configured for. To receive the
# notification the GENERIC_RESOURCE_CMD of the specified cluster generic
# resource should be configured with signal handler. The below table explains
# the signals used based upon the GENERIC_RESOURCE_NOTIFY_FLAG value configured.
# |------------------------------------------------------------------|
# |SL No|VALUE|SIGNAL |Details |
# |-----|-----|----------------------|-------------------------------|
# | 1 |00 |None |NA |
# |-----|-----|----------------------|-------------------------------|
# | 2 |01 |SIGRTMIN+1 |Node status change |
# |-----|-----|----------------------|-------------------------------|
# | 3 |10 |SIGRTMIN+3 |Cluster generic resource change|
# |-----|-----|----------------------|-------------------------------|
# | 4 |11 |SIGRTMIN+1, SIGRTMIN+3|Both Node and Cluster generic |
# | | | |resource change |
# |-----|-----|----------------------|-------------------------------|
#
# For example,
# When GENERIC_RESOURCE_NOTIFY_FLAG is set with value "01", the
# notification for change in node status will be delivered to "app_mon"
# GENERIC_RESOURCE_NAME app_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/app_monitor.sh
# GENERIC_RESOURCE_SCOPE CLUSTER
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
# GENERIC_RESOURCE_NOTIFY_FLAG 01
#
# When GENERIC_RESOURCE_NOTIFY_FLAG is set with value "10", the
# notfication for change in the status or value of "app_mon" will
# be delivered to "app_mon"
# GENERIC_RESOURCE_NAME app_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/app_monitor.sh
# GENERIC_RESOURCE_SCOPE CLUSTER
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
# GENERIC_RESOURCE_NOTIFY_FLAG 10
#
#
# The attribute "GENERIC_RESOURCES_TO_NOTIFY" allows to specify one or more
# space seperated cluster generic resources configured in the cluster. These
# will be notified when the cluster generic resource for which this attribute
# is configured for changes. The value cannot contain the name of the cluster
# generic resource under which it is being configured. This parameter is
# optional.
# Linux OS signal is used to deliver the notification to one or more space
# separated cluster generic resource(s) specified, when the cluster generic
# resource for which this attribute is configured for changes. To receive
# notification the GENERIC_RESOURCE_CMD of the cluster generic resource
# specified should be configured with signal handler. Signal SIGRTMIN+4 will be
# delivered as notification.
# For example,
# Notification will be delivered to cluster generic resource "app_mon"
# when the status or value of "disk_mon" changes.
# Below example explains the same.
# GENERIC_RESOURCE_NAME app_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/app_monitor.sh
# GENERIC_RESOURCE_SCOPE CLUSTER
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
#
# GENERIC_RESOURCE_NAME disk_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/disk_monitor.sh
# GENERIC_RESOURCE_SCOPE NODE
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
# GENERIC_RESOURCES_TO_NOTIFY app_mon
#
#
# For more information refer to Managing Serviceguard manual
# Example 1: Monitoring of generic resource at cluster level
#
# GENERIC_RESOURCE_NAME app_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/app_monitor.sh
# GENERIC_RESOURCE_SCOPE NODE
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
# GENERIC_RESOURCE_NOTIFY_FLAG 01
# GENERIC_RESOURCES_TO_NOTIFY disk_mon
#
# GENERIC_RESOURCE_NAME disk_mon
# GENERIC_RESOURCE_TYPE simple
# GENERIC_RESOURCE_CMD /usr/bin/disk_monitor.sh
# GENERIC_RESOURCE_SCOPE CLUSTER
# GENERIC_RESOURCE_RESTART none
# GENERIC_RESOURCE_HALT_TIMEOUT 60000000
# GENERIC_RESOURCE_NOTIFY_FLAG 10
#
# GENERIC_RESOURCE_NAME
# GENERIC_RESOURCE_TYPE
# GENERIC_RESOURCE_CMD
# GENERIC_RESOURCE_SCOPE
# GENERIC_RESOURCE_RESTART
# GENERIC_RESOURCE_HALT_TIMEOUT
# GENERIC_RESOURCE_NOTIFY_FLAG
# GENERIC_RESOURCES_TO_NOTIFY

# For cluster generic resource "sitecontroller_genres", only GENERIC_RESOURCE_NAME,
# GENERIC_RESOURCE_TYPE and GENERIC_RESOURCE_SCOPE can be configured. Other cluster
# generic resource parameters can not be configured for this resource.
# For example,

GENERIC_RESOURCE_NAME sitecontroller_genres
GENERIC_RESOURCE_TYPE extended
GENERIC_RESOURCE_SCOPE SITE

# Subscription identifier for the cloud compute
#AZURE_SUBSCRIPTION_ID

# Enter an email address to receive email alerts on expiring
# Serviceguard licenses/certificates applied on cluster node(s)
# EMAIL_ADDRESS alert_example@hpe.com
#
# EMAIL_ADDRESS

# Set LICENSE_ALERT to ON to start receiving email alerts on
# expiry dates of Serviceguard licenses applied on cluster
# node(s). Set LICENSE_ALERT to OFF to disable license alerts.
# LICENSE_ALERT is enabled by default when EMAIL_ADDRESS is provided,
# and when enabled you will receive email alerts as follows:
# 180 days to expire - Notify once every 30 days
# 90 days to expire - Notify once every 15 days
# 45 days to expire - Notify once every 7 days
# 15 days to expire or post expiry - Notify once every day
#
#LICENSE_ALERT ON
#

# Set CERTIFICATE_ALERT to ON to start receiving email alerts on
# expiry dates of Serviceguard certificates applied on cluster
# node(s). Set CERTIFICATE_ALERT to OFF to disable license alerts.
# CERTIFICATE_ALERT is enabled by default when EMAIL_ADDRESS is provided,
# CERTIFICATE_ALERT ON
#

# Specify valid date and timestamp format for Serviceguard package logs.
# If "SG_DATE_TIME_FORMAT" is not specified then default format,
# +%b %e %H:%M:%S, will be used.
# The format string should be specified without double quotes.
# Example 1: SG_DATE_TIME_FORMAT +%b %e %Y %H:%M:%S
# Example 2: SG_DATE_TIME_FORMAT +%FT%H:%M:%S
#
# NOTE: Make sure that the format specified should be valid arguments to
# "date" command in linux. For invalid instance default format will be used.
#
# SG_DATE_TIME_FORMAT

package hdbpSHP.conf

# -- /opt/cmcluster/bin/deploysappkgs
# -- SGeSAP - A.15.30.01
# -- s4hanadba001p - Wed Aug 28 18:21:46 2024

package_name hdbpSHP
package_description "SGeSAP,HDB SHP HDB00"
module_name sg/basic
module_version 1
module_name sg/service
module_version 1
module_name sg/priority
module_version 1
module_name sg/dependency
module_version 1
module_name sg/pr_cntl
module_version 2
module_name sg/package_ip
module_version 1
module_name sgesap/hdbprimary
module_version 1
module_name sg/generic_resource
module_version 1
module_name sgesap/hdbinstance
module_version 3
module_name sgesap/hdb_global
module_version 1
module_name sgesap/hdbinstance_global
module_version 1
module_name sg/failover
module_version 1
package_type failover
# -- deploysappkgs new attribute --
node_name s4hanadba001p
# -- deploysappkgs new attribute --
node_name s4hanadba01pr
auto_run yes
node_fail_fast_enabled no
run_script_timeout no_timeout
halt_script_timeout no_timeout
successor_halt_timeout no_timeout
# -- deploysappkgs new attribute --
script_log_file $SGRUN/log/hdbSHP.log
operation_sequence $SGCONF/scripts/sg/pr_cntl.sh
operation_sequence $SGCONF/scripts/sgesap/hdbprimary.sh
operation_sequence $SGCONF/scripts/sgesap/hdbinstance.sh
operation_sequence $SGCONF/scripts/sg/package_ip.sh
operation_sequence $SGCONF/scripts/sg/service.sh
failover_policy configured_node
failback_policy manual
# -- deploysappkgs new attribute --
priority 990
sgesap/hdbprimary/hdb_quiesce_timeout 60
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_system SHP
# -- deploysappkgs new attribute --
sgesap/hdbinstance_global/hdb_instance HDB00
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_sample_size 20
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_safety_factor 5
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_minimum_threshold 10
sgesap/hdb_global/hdb_retry_count 25
sgesap/hdb_global/hdb_retry_count_hdbsql 25
# -- deploysappkgs new attribute --
dependency_name hdbsSHP_rep
# -- deploysappkgs new attribute --
dependency_location same_node
# -- deploysappkgs new attribute --
dependency_condition hdbsSHP = down
# -- deploysappkgs new attribute --
sgesap/hdbprimary/hdb_sync_time_tolerance 40
# -- deploysappkgs new attribute --
standard_workload_name HANA_SHP^~SHP_db
# -- deploysappkgs new attribute --
standard_workload_type sap_hsr_su_database
# -- deploysappkgs new attribute --
ip_subnet 192.168.65.0
# -- deploysappkgs new attribute --
ip_address 192.168.65.41
# -- deploysappkgs new attribute --
ip_subnet_node s4hanadba001p
# -- deploysappkgs new attribute --
ip_subnet 10.8.23.0
# -- deploysappkgs new attribute --
ip_address 10.8.23.41
# -- deploysappkgs new attribute --
ip_subnet_node s4hanadba01pr
# -- deploysappkgs new attribute --
generic_resource_name sitecontroller_genres
# -- deploysappkgs new attribute --
generic_resource_evaluation_type during_package_start
# -- deploysappkgs new attribute --
generic_resource_up_criteria >1
# -- deploysappkgs new attribute --
service_name hdbpSHPhdbsys
# -- deploysappkgs new attribute --
service_cmd $SGCONF/monitors/sgesap/saphdbsys.mon
# -- deploysappkgs new attribute --
service_restart none
# -- deploysappkgs new attribute --
service_fail_fast_enabled no
# -- deploysappkgs new attribute --
service_halt_timeout 5

package hdbsSHP.conf

# -- /opt/cmcluster/bin/deploysappkgs
# -- SGeSAP - A.15.30.01
# -- s4hanadba01pr - Wed Aug 28 18:21:46 2024

package_name hdbsSHP
package_description "SGeSAP,HDB SHP HDB00"
module_name sg/basic
module_version 1
module_name sg/failover
module_version 1
module_name sg/priority
module_version 1
module_name sg/pr_cntl
module_version 2
module_name sgesap/hdbinstance
module_version 3
module_name sgesap/hdb_global
module_version 1
module_name sgesap/hdbinstance_global
module_version 1
module_name sg/service
module_version 1
module_name sg/dependency
module_version 1
module_name sg/package_ip
module_version 1
module_name sg/generic_resource
module_version 1
package_type failover
# -- deploysappkgs new attribute --
node_name s4hanadba01pr
# -- deploysappkgs new attribute --
node_name s4hanadba001p
auto_run yes
node_fail_fast_enabled no
run_script_timeout no_timeout
halt_script_timeout no_timeout
successor_halt_timeout no_timeout
# -- deploysappkgs new attribute --
script_log_file $SGRUN/log/hdbSHP.log
operation_sequence $SGCONF/scripts/sg/pr_cntl.sh
operation_sequence $SGCONF/scripts/sgesap/hdbinstance.sh
operation_sequence $SGCONF/scripts/sg/package_ip.sh
operation_sequence $SGCONF/scripts/sg/service.sh
failover_policy configured_node
failback_policy manual
# -- deploysappkgs new attribute --
priority 1000
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_system SHP
# -- deploysappkgs new attribute --
sgesap/hdbinstance_global/hdb_instance HDB00
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_sample_size 20
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_safety_factor 5
# -- deploysappkgs new attribute --
sgesap/hdb_global/hdb_sto_minimum_threshold 10
sgesap/hdb_global/hdb_retry_count 25
sgesap/hdb_global/hdb_retry_count_hdbsql 25
# -- deploysappkgs new attribute --
dependency_name hdbpSHP_rep
# -- deploysappkgs new attribute --
dependency_location same_node
# -- deploysappkgs new attribute --
dependency_condition hdbpSHP = down
# -- deploysappkgs new attribute --
standard_workload_name HANA_SHP^~SHP_db
# -- deploysappkgs new attribute --
standard_workload_type sap_hsr_su_database
# -- deploysappkgs new attribute --
generic_resource_name sitecontroller_genres
# -- deploysappkgs new attribute --
generic_resource_evaluation_type during_package_start
# -- deploysappkgs new attribute --
generic_resource_up_criteria >1
# -- deploysappkgs new attribute --
service_name hdbsSHPhdbsys
# -- deploysappkgs new attribute --
service_cmd $SGCONF/monitors/sgesap/saphdbsys.mon
# -- deploysappkgs new attribute --
service_restart none
# -- deploysappkgs new attribute --
service_fail_fast_enabled no
# -- deploysappkgs new attribute --
service_halt_timeout 5

Mr_Techie · ‎08-29-2024

@Alex116

It sounds like you're dealing with a failover scenario that is not behaving as expected. In an SAP HANA active/standby setup with HPE ServiceGuard, the key is ensuring proper synchronization between the primary and standby nodes and configuring the packages and failover behavior correctly.

Check that the quorum (smart_quorum) is correctly set up, as this helps in determining which node should take over. It appears you have set it up using multiple interfaces, which is good, but make sure the quorum works as expected in a failover scenario.

The failure to start the hdbpSHP package on 's4hanadba01pr' might be due to the replication direction not switching. In a scale-up active/standby scenario, the replication direction must change before the standby node can become active. Check the SAP HANA system replication settings to ensure the automatic failover to the standby node is functioning as expected.

Let me know..

Alex116 · ‎08-29-2024

@Mr_Techie

Hi. Thanks for the answer, I don't quite understand. This is serviceguard telling SAP HANA via scripts that the backup node s4hanadba01pr will become the primary node. what is the command for SAP HANA or in which file should I write this?

s4hanadba001p:shpadm> sapcontrol -nr 00 -function HACheckFailoverConfig

29.08.2024 22:16:35
HACheckFailoverConfig
OK
state, category, description, comment
SUCCESS, HA CONFIGURATION, Serviceguard config, Instance maps to cluster package(s) hdbpSHP,hdbsSHP

s4hanadba001p:shpadm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: true

mode: primary
operation mode: primary
site id: 1
site name: SHPsiteA

is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: true
is a takeover active: false
is primary suspended: false

Host Mappings:
~~~~~~~~~~~~~~

s4hanareppri -> [SHPsiteB] s4hanarepsec
s4hanareppri -> [SHPsiteA] s4hanareppri

Site Mappings:
~~~~~~~~~~~~~~
SHPsiteA (primary/primary)
|---SHPsiteB (async/logreplay)

Tier of SHPsiteA: 1
Tier of SHPsiteB: 2

Replication mode of SHPsiteA: primary
Replication mode of SHPsiteB: async

Operation mode of SHPsiteA: primary
Operation mode of SHPsiteB: logreplay

Mapping: SHPsiteA -> SHPsiteB

Hint based routing site:
done.

s4hanadba01pr:shpadm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: false

mode: async
operation mode: unknown
site id: 2
site name: SHPsiteB

is source system: unknown
is secondary/consumer system: true
has secondaries/consumers attached: unknown
is a takeover active: false
is primary suspended: false
is timetravel enabled: false
replay mode: auto
active primary site: 1

primary masters: s4hanareppri
done.
s4hanadba01pr:shpadm> hdbnsutil -sr_state

System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~

online: false

mode: async
operation mode: unknown
site id: 2
site name: SHPsiteB

is source system: unknown
is secondary/consumer system: true
has secondaries/consumers attached: unknown
is a takeover active: false
is primary suspended: false
is timetravel enabled: false
replay mode: auto
active primary site: 1

primary masters: s4hanareppri
done.

Mr_Techie · ‎08-29-2024

@Alex116

To make the 's4hanadba01pr' node (SHPsiteB) become the primary node during a failover, ServiceGuard uses scripts to trigger the SAP HANA system takeover. The command needed is:

hdbnsutil -sr_takeover

This command triggers SAP HANA to take over as the primary system on the standby node. It needs to be part of the ServiceGuard failover process, ideally incorporated into the failover scripts so that when the 'hdbpSHP' package is moved to the 's4hanadba01pr' node, the takeover command is executed automatically.

Alex116 · ‎09-02-2024

I think need ERS for automaticaly failover. that you think?

setset · ‎09-02-2024

Is the issue resolved?
I am having the same issue.
My environment is rhel 9.2 and HPE Service Guard 15.30.00.
cluster.conf and pkg config are the same as Alex.
I also cannot failover properly when I use the command to manually failover or force reboot.
When I reboot node 1, "hdbnsutil -sr_takeover" is executed on node 2, but watchdog occurs and the primary pkg fails after timeout 2800 seconds.

setset · ‎09-02-2024

@Mr_Techie The manual failover procedure using cmpushpkg guided by HPE service guard is as follows. (SID - DA1, Instance Number - 00) [root@sapnode1 ~]# cmpushpkg -dfFt -n sapnode2 hdbpDA1 Package hdbsDA1 occupies the target node Temporary automatic failover disablement for hdbpDA1 - skipped Enable HANA-halt-detach on all nodes - skipped Halting package hdbpDA1 - skipped Additional capacity needed Temporary automatic failover disablement for hdbsDA 1 - skipped Halting package hdbsDA1 to free capacity - skipped Assign double primary production token on sapnode2 - skipped Enabling sapnode2 for package hdbpDA1 - skipped Running package hdbpDA1 on node sapnode2 - skipped Disable HANA-halt-detach on all nodes - skipped (Re-)enable automatic failover for hdbpDA1 - skipped
(Re-)enable automatic failover for hdbsDA1 - skipped

I tried failover using the command below.
[root@sapnode1 ~]# cmpushpkg -f -n sapnode2 hdbpDA1

During the above process, the PKG starting process is not completed during the "Running package hdbpDA1 on node sapnode2" process.
As mentioned above,
(takeover_hdb): running cmd: /usr/sap/DA1/HDB00/exe/hdbnsutil -sr_takeover
(watchdog): Watchdog initiated (PIDs: 36367 for 36357/36357 - Timeout: 2800 secs)
After timeout, the PKG starting fails.

The generated log is as follows.

Sep 3 09:25:43 root@sapnode2 master_control_script.sh[32893]: ###### Starting package hdbpDA1 ######
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: Entering SGeSAP hdbprimary.sh 'start' runtime steps ...
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: hdbprimary.sh - #"@(#) HPE Serviceguard SAP Add-On - A.15.30.00" - 2122999194 4814
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: Not used: /usr/local/cmcluster/conf/scripts/ext/SAP_customer_functions.sh
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: Found: /usr/local/cmcluster/conf/scripts/sgesap/sap_functions.sh
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: sap_functions.sh - #"@(#) HPE Serviceguard SAP Add-On - A.15.30.00" - 353416140 992146
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: Found: /usr/local/cmcluster/conf/scripts/sgesap/customer_functions.sh
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: customer_functions.sh - #"@(#) HPE Serviceguard SAP Add-On - A.15.30.00" - 2922905124 4699
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: Not used: /usr/local/cmcluster/conf/scripts/ext/hdbpDA1_customer_functions.sh
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_versions): TRACE POINT
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_lxversion 9.2 8.0): TRACE POINT
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_lxversion): LXVERSION=[9.2] MINVERSION=[8.0]
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_versions): Found HW platform=(Linux) Linux release=(RHEL)
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_versions): Version check passed (LX:9.2, SG:15.30.00)
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_versions): Log Level is 5
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_debug): TRACE POINT
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (check_debug): HA APIDIR: /usr/local/cmcluster/run/.sgesap/api
Sep 3 09:25:43 root@sapnode2 hdbprimary.sh[33070]: (clean_package_procs hdbprimary): TRACE POINT
Sep 3 09:25:44 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): TRACE POINT
Sep 3 09:25:44 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): Configured HANA System ID: DA1
Sep 3 09:25:44 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): Configured HANA Instance: HDB00
Sep 3 09:25:44 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 33229 for 33223/33223 - Timeout: 30 secs)
Sep 3 09:25:44 root@sapnode2 hdbprimary.sh[33070]: (watchdog): wait for PID 33223 returns 0
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): HDB version: 2.00.076.00
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB 00 sapnode2 DA1): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB00): Instance administrator : 'da1adm'
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB00): Instance home directory : '/usr/sap/DA1/home'
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB00): Startprofile : '/usr/sap/DA1/SYS/profile/DA1_HDB00_sapnode2'
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB00): Instance profile : '/usr/sap/DA1/SYS/profile/DA1_HDB00_sapnode2'
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB00): Instance work directory : '/usr/sap/DA1/HDB00/sapnode2/trace'
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): Check Instance Agent setup...
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (prepare_sapserviceconfig): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): /usr/sap/sapservices readable
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB 00 sapnode2 DA1): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): [systemctl --no-ask-password start SAPDA1_00 # sapstartsrv pf=/usr/sap/DA1/SYS/profile/DA1_HDB00_sapnode2]
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): Package will handle Instance Agent of HDB00
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): found location: "/usr/sap/DA1/HDB00/exe/sapstartsrv" owner: "da1adm"
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_sapstartsrv_systemd /usr/sap/DA1/HDB00/exe/sapstartsrv): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_sapstartsrv_systemd): SAPSTARTSRV_NOREGISTER=[1], SAPSTARTSRV_NOSYSTEMD=[]
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (init_params_hdb_inst): Self-tuning TimeOuts configured
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (collect_eps_updparams): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_own_tier): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_hdb_scaleout): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_hdbmultidb): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_hdbmultidb): HDB is a multi-tenant DB
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_hdbmultidb): Found multi-tenant HDBMDC_SIDS=[ DA1 ]
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_peer_packages): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_peer_packages): Corresponding package found: hdbsDA1
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_peer_packages): hdbsDA1 has replication services
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_peer_packages): Corresponding package found: hdbpDA1
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (detect_peer_packages): hdbpDA1 has primary services
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (prepare_sapserviceconfig): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (set_environment HDB 00 sapnode2 DA1): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (check_provider): TRACE POINT
Sep 3 09:25:45 root@sapnode2 hdbprimary.sh[33070]: (check_provider): HANA instance is configured to use safesync
Sep 3 09:25:46 root@sapnode2 hdbprimary.sh[33070]: (check_provider): HANA generic resources used in package mode
Sep 3 09:25:46 root@sapnode2 hdbprimary.sh[33070]: (safesync_init): TRACE POINT
Sep 3 09:25:46 root@sapnode2 hdbprimary.sh[33070]: (safesync_init): Check access to remote safesync persistence on sapnode1 ...
Sep 3 09:25:46 root@sapnode2 hdbprimary.sh[33070]: (safesync_init): Success...
Sep 3 09:25:46 root@sapnode2 hdbprimary.sh[33070]: (safesync_init): Performing full safesync check
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (safesync_init): WARNING: Autorun is disabled and will not become enabled automatically after secondary syncup
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (hadr_listener on): TRACE POINT
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (hadr_listener): Recreate HADR provider pipe
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (hadr_listener): Forked hadr listener (PID: 34605)
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): used location: "/usr/sap/DA1/HDB00/exe/sappfpar" owner: "da1adm"
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (sap_functions): WARNING: Instance Autostart via Instance Agent detected - delay start of Instance Agent
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb HDB 00 DA1): TRACE POINT
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (check_replication HDB 00 DA1): TRACE POINT
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): used location: "/usr/sap/DA1/HDB00/exe/hdbnsutil" owner: "da1adm"
Sep 3 09:25:47 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 34825 for 34821/ 34820 - Timeout: 25 secs)
Sep 3 09:25:48 root@sapnode2 hdbprimary.sh[33070]: (watchdog): wait for PID 34821 returns 0
Sep 3 09:25:48 root@sapnode2 hdbprimary.sh[33070]: (check_replication): Current replication state is secondary
Sep 3 09:25:48 root@sapnode2 hdbprimary.sh[33070]: (check_replication): Check whether takeover is pending
Sep 3 09:25:48 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 35106 for 35102/ 35101 - Timeout: 15 secs)
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (watchdog): wait for PID 35102 returns 0
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): Check whether takeover is feasible...
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): used location: "/usr/sap/DA1/HDB00/exe/sapcontrol" owner: "da1adm"
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): The local replication state does not match to a primary package - execute scale-up safety checks...
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): Primary package start was triggered by explicit attempt to start on a server with secondary instance.
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (remote_check_replication HDB 00 DA1): TRACE POINT
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (remote_check_replication): This package start is the first one triggered after cluster start
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (remote_sr_state HDB 00 DA1 sapnode1): TRACE POINT
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): used location: "/usr/sap/DA1/HDB00/exe/hdbnsutil" owner: "da1adm"
Sep 3 09:25:50 root@sapnode2 hdbprimary.sh[33070]: (remote_sr_state): Running [cmexec sapnode1 -t 15 su - da1adm -c '/usr/sap/DA1/HDB00/exe/hdbnsutil -sr_stateConfiguration']
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (remote_sr_state): Replication state is primary for node=[sapnode1]
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): No safesync block - proceeding
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (STO_calc_timeout HDB 00 DA1 takeover_hdb_time): TRACE POINT
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (STO_calc_timeout): Calculating new timeouts from timings in file=[/usr/local/cmcluster/run/.sgesap/STO_SAVED_TIMES/hdbpDA1]
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (STO_calc_timeout): Found a tmax=[0] for action=[takeover_hdb_time] from previous run(s)
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (STO_calc_timeout): instance=[HDB00] Using HDB_STO_TAKEOVER_TIMEOUT=[2800]
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (update_globalini): TRACE POINT
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): Trigger takeover operation...
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (find_sap_binary): used location: "/usr/sap/DA1/HDB00/exe/hdbnsutil" owner: "da1adm"
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (takeover_hdb): running cmd: /usr/sap/DA1/HDB00/exe/hdbnsutil -sr_takeover
Sep 3 09:25:52 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 36367 for 36357/36357 - Timeout: 2800 secs)
Sep 3 09:25:54 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): HADR provider trigger received
Sep 3 09:25:54 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Activate persistence block for failover to hdbsDA1...
Sep 3 09:25:55 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Trigger succeeded locally - try to propagate remotely
Sep 3 09:25:55 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 36528 for 36523/ 34605 - Timeout: 15 secs)
Sep 3 09:25:55 root@sapnode2 hdbprimary.sh[33070]: (watchdog): wait for PID 36523 returns 0
Sep 3 09:25:55 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Remote propagation via node [sapnode1]
Sep 3 09:25:55 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Trigger succeeded
Sep 3 09:25:56 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): HADR provider trigger received
Sep 3 09:25:56 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Activate eligibility block for failover to hdbsDA1...
Sep 3 09:25:56 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Eligibility propagation to node [sapnode1]
Sep 3 09:25:56 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Trigger succeeded
Sep 3 09:25:57 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): HADR provider trigger received
Sep 3 09:25:57 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Avoid disablement block...
Sep 3 09:25:58 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Trigger succeeded locally - try to propagate remotely
Sep 3 09:25:58 root@sapnode2 hdbprimary.sh[33070]: (watchdog): Watchdog initiated (PIDs: 36611 for 36606/ 34605 - Timeout: 15 secs)
Sep 3 09:25:58 root@sapnode2 hdbprimary.sh[33070]: (watchdog): wait for PID 36606 returns 0
Sep 3 09:25:58 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Remote propagation via node [sapnode1]
Sep 3 09:25:58 root@sapnode2 hdbprimary.sh[33070]: (runTrigger): Trigger succeeded

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: HPE Service Guard 15.00.01 on SLES 15 SP4 + SAP HANA scale-up

HPE Service Guard 15.00.01 on SLES 15 SP4 + SAP HANA scale-up