Operating System - HP-UX
1833780 Members
2608 Online
110063 Solutions
New Discussion

Re: Serviceguard shutting down packages during upgrades

 
Jeff_
Occasional Advisor

Serviceguard shutting down packages during upgrades

Hi was wondering I inherited an old serviceguard cluster that has a single Cisco switch with two virtual networks so when they do firmware updates both sides go down long enough its shuts down all of the packages on my cluster. I do have heartbeat switch configured but I am not sure what value to set so I dont loose all the packages when they do switch upgrades that it doesnt shutdown the whole cluster. I am thinking I need a 10 minute timeout if the heartbeat is ok on the cheapo heartbeat switch. Not perfect but better than the package dropping dead.
16 REPLIES 16
Kapil Jha
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

I could not understand your requiremnt all I can get is you want to incrase heartbeat interval to 10min which is impossible maximum value supported is 30sec.

http://docs.hp.com/en/T2767-90067/ch11s05.html

Need more info.

BR,
Kapil+
I am in this small bowl, I wane see the real world......
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

There are 2 parameters cabn help you, one is MEMBER_TIMEOUT:

# Cluster Timing Parameters (microseconds).
# The MEMBER_TIMEOUT parameter defaults to 14000000 (14 seconds).
# If a heartbeat is not received from a node within this time, it is
# declared dead and the cluster reforms without that node.
# A value of 10 to 25 seconds is appropriate for most installations.
# For installations in which the highest priority is to reform the cluster
# as fast as possible, a setting of as low as 3 seconds is possible.
# When a single heartbeat network with standby interfaces is configured,
# MEMBER_TIMEOUT cannot be set below 14 seconds if the network interface
# type is Ethernet, or 22 seconds if the network interface type is
# InfiniBand (HP-UX only).
# Note that a system hang or network load spike whose duration exceeds
# MEMBER_TIMEOUT will result in one or more node failures.
# The maximum value recommended for MEMBER_TIMEOUT is 60000000
# (60 seconds).


The other is in package config file: SUBNET or MONITORED_SUBNET:
# "MONITORED_SUBNET" specifies the addresses of subnets that are to be
# monitored for this package.
#
# Enter the network subnet name that is to be monitored for this package.
# Repeat this line as necessary for additional subnet names. If any of
# the subnets defined goes down, the package will be switched to another
# node that is configured for this package and has all the defined subnets
# available.
#
# "MONITORED_SUBNET" replaces "SUBNET".

I think the real solution to the switch related issues, to use redundant switch config on the network.

Matti_Kurkela
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

If your production network is down but heartbeat is OK on your heartbeat switch, the cluster nodes can still communicate with each other, so MEMBER_TIMEOUT is not an issue.

MEMBER_TIMEOUT is used only when node(s) become totally unreachable. Even a single functioning heartbeat network is enough to avoid MEMBER_TIMEOUT.

But if you have SUBNET/MONITORED_SUBNET declarations in your package configuration, they might cause the packages to move to a node that has good connectivity to the monitored subnet.

I seem to recall that Serviceguard is smart enough to check whether the subnet is reachable or not on the potential failover nodes *before* starting to move packages around: if all nodes have lost connectivity to the monitored subnet, moving the packages won't help, so ServiceGuard should do nothing. Unfortunately I cannot find the documentation to confirm my dim memory just now...

If you wish to make sure the packages won't move around during a planned maintenance, temporarily disable the AUTO_RUN flag with

cmmodpkg -d

for the duration of the maintenance. After the maintenance is complete, re-enable the automatic failover with:

cmmodpkg -e

MK
MK
Stephen Doud
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

Serviceguard's default configuration is to send heartbeat every second. NODE_TIMEOUT (cluster ASCII file) can also be configured to disregard loss of heartbeat for up to 30 seconds. After that, SG performs a cluster reformation to avoid split brain syndrome - the result is one node in a 2-node cluster rebooting itself. The node that reboots cannot be predicted.

If you expect a total network outage longer than the NODE_TIMEOUT value, shutdown the cluster until repairs are completed.

If only one HB network switch is repaired at a time and the package networks are configured with redundancy in mind, Serviceguard can move the package networks to a standby LAN (on a different switch) without interruption of service. If the network is not configured with redundancy in mind, SG will halt the packages during the subnet outage and will restart them automatcally upon detection of functioning networks.

Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

The problem I am having in my 3 node cluster is when the public network goes down for all 3 nodes my packages all shutdown and do not come back up by themselves. I dont want them to shutdown the packages if the public network goes down on all 3 nodes but not on the secondary heartbeat network. The issue is every time is there is work or a problem on this switch it is a single point of failure even though it has redundancy for the main public network. I would have never designed it this way but it is what I inherited and budgets are tight. So what I am looking for is that the cluster does no failover or total package shutdown if all nodes loose their public network but not heartbeat while a single node loss would perform normal package failover. I would like the cluster to be able to endure a 10 to 15 minute total network failure to all 3 nodes without shutting down all the packages and making a bigger mess of things. I would hope there is some value i coudl set to get this behavior but im not sure how.

I have the following parameters set.

HEARTBEAT_INTERVAL 3000000
NODE_TIMEOUT 8000000
AUTO_START_TIMEOUT 600000000
NETWORK_POLLING_INTERVAL 3000000
NETWORK_FAILURE_DETECTION INOUT
MAX_CONFIGURED_PACKAGES 20
Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

Also none of the nodes rebooted themselves Stephen maybee that is because all 3 lost public network for 15 minutes so it must have uppassed a stop trying threshold as well and left all packages shtudown even though autorun was on?
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

If you dont want that the package go down when public network is down, you should comment SUBNET/MONITORED_SUBNET fields in the package config file, and reapply your new config.

You can try to set NODE_TIMEOUT to higher values but it also means that your cluster formation and package failover times will increase.

If you a chance to use a hub (you know they are very cheap) for heartbeat network, then it will give solve your problems.
Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

I do not see subnet anything in the cluster ASCII file. Is it monitored by default and you have to put something explicit to exclude it did the parmater name change over the years? What you said makes sense but I do not see a subnet parameter anywhere in the current running cluster config.
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

Not in cluster config file, it is in package config file. Probably in /etc/cmcluster/PKG_xxx .
Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

There is a subnet value defined in the package config files. Is there a value to set what the poll interval is and how many failures on the subnet polling so I can set it to say 10 minutes downtime before failover?
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

There is no "poll interval" parameter for SUBNET. You set it or doesnt set it. If you set it, this means: when your network card which is serving for your subnet goes down (and if there is no standby card) the package will be switched to another nodes. If you don't set it, this means the package continue to run even your network card goes down.
Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

Is it required to have a subnet defined in servieguard if you dont want it monitored? If you do want it monitored is there a way to up the number of failures and poll intervals before it fails the package over?
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

No SUBNET value is not required, it totally depends on your needs on the environment.
Jeff_
Occasional Advisor

Re: Serviceguard shutting down packages during upgrades

Is there a way to up the number failures and timeout interval on the subnet?
Turgay Cavdar
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

As far as i know there is no way to set number of failures and timeout interval for the SUBNET value.
Stephen Doud
Honored Contributor

Re: Serviceguard shutting down packages during upgrades

Subnet failures are determined when SG finds that a NIC fails and an IP cannot be transferred to a standby NIC; it also marks the subnet down. If any packages have a dependency on that subnet (via the SUBNET parameter in the package configuration file), it halts the packages.

If a package is configured such that it does -not- have a SUBNET dependency (commented out, and cmapplyconf'd), then Serviceguard will not halt that package on a subnet failure. There is no requirement to make a package dependent on a subnet.

There are no package-specific subnet failure timing parameters available.

So, to avoid a package halt on a NIC/network failure, don't use the SUBNET parameter in the package configuration file and cmapplyconf it.