1833758 Members
3089 Online
110063 Solutions
New Discussion

cluster reforming time

 
SOLVED
Go to solution
???_185
Regular Advisor

cluster reforming time

Hi all,

I'm running SG 11.16 on HPUX 11i and I have a 2 node cluster.

I want to reduce cluster reforming time when server failure.

i hope it finish in 30 seconds.

Is it impossible..?

Thanks in advance

Rgds
zungwon



May 11 18:30:05 apollobk cmclconfd[5251]: Updated file /etc/cmcluster/cmclconfi.
May 11 18:33:31 apollobk cmcld: Communication to node apollo has been interruptd
May 11 18:33:31 apollobk cmcld: Node apollo may have died
May 11 18:33:31 apollobk cmcld: Attempting to form a new cluster
May 11 8:33:31 apollobk cmcld: Beginning standard election
May 11 18:33:32 apollobk cmclconfd[5251]: Updated file /var/adm/cmcluster/frdum.
May 11 18:33:44 apollobk cmcld: Obtaining Cluster Lock
May 11 18:33:45 apollobk cmcld: Turning off safety time protection since the clr
May 11 18:33:45 apollobk cmcld: may now consist of a single node. If Servicegud
May 11 18:33:45 apollobk cmcld: fails, this node will not automatically halt
May 11 18:33:45 apollobk cmcld: This will not affect the behavior of Package Fat
May 11 18:33:45 apollobk cmcld: or Service Failfast. If such a package or servi,
May 11 18:33:45 apollobk cmcld: safety timer will be re-enabled and this node l
May 11 18:33:45 apollobk cmcld: automatically halt.
May 11 18:35:00 apollobk cmcld: Link level address on network interface lan900 .
May 11 18:35:10 apollobk cmcld: Link level address on network interface lan900 .
May 11 18:35:49 apollobk cmcld: 1 nodes have formed a new cluster, sequence #2
May 11 18:35:49 apollobk cmcld: The new active cluster membership is: apollobk()
May 11 18:35:49 apollobk cmcld: One of the nodes is down.
May 11 18:35:49 apollobk cmcld: One or more packages may not be currently runni.
May 11 18:35:50 apollobk cmclconfd[5257]: Updated file /etc/cmcluster/cmclconfi.
May 11 18:35:50 apollobk cmclconfd[5257]: Updated file /etc/cmcluster/cmclconfi.
May 11 18:35:50 apollobk cmclconfd[5251]: Updated file /etc/cmcluster/cmclconfi.
May 11 18:36:34 apollobk cmcld: Link level address on network interface lan900 .
May 11 18:36:37 apollobk cmcld: Link level address on network interface lan900 .
May 11 18:36:52 apollobk su: + 2 root-tuxedo
May 11 18:39:11 apollobk cmcld: Link level address on network interface lan900 .
May 11 18:41:34 apollobk cmcld: New node apollo is joining the cluster
May 11 18:41:34 apollobk cmcld: Attempting to adjust cluster membership
May 11 18:39:09 apollobk cmcld: Link level address on network interface lan900 .
5 REPLIES 5
Darren Murray_1
Frequent Advisor

Re: cluster reforming time

Zungwon,

The cluster configuration file controls the timeout and reformation of the cluster.

Look for the following section. Your NODE_TIMEOUT maybe set to 30000000 (30 seconds)

I have reduced the cluster I manage to 5 seconds

HEARTBEAT_INTERVAL 1000000
NODE_TIMEOUT 30000000

Configuration/Reconfiguration Timing Parameters (microseconds).

AUTO_START_TIMEOUT 600000000
NETWORK_POLLING_INTERVAL 2000000


Thanks Darren

Warren_9
Honored Contributor

Re: cluster reforming time

hi,

from the SG manual.

The cmquerycl command supplies default cluster timing parameters for HEARTBEAT_INTERVAL and NODE_TIMEOUT. Changing these parameters will directly impact the clusterâ s reformation and failover times. It is useful to modify these parameters if the cluster is reforming occasionally due to heavy system load or heavy network traffic.

The default value of 2 seconds for NODE_TIMEOUT leads to a best case failover time of 30 seconds. If NODE_TIMEOUT is changed to 10 seconds, which means that the cluster manager waits 5 times longer to timeout a node, the failover time is increased by 5, to approximately 150 seconds.
NODE_TIMEOUT must be at least 2*HEARTBEAT_INTERVAL. A good rule of thumb is to have at least two or three heartbeats within one NODE_TIMEOUT.

GOOD LUCK!!
John Bigg
Esteemed Contributor
Solution

Re: cluster reforming time

There are a number of factors which affect the failover time. The most important of these is the NODE_TIMEOUT but decreasing the heartbeat interval also helps.

The other parameters mention such as network polling interval DO NOT affact the failover time.

The other things which affect failover time are the cluster lock type with quorum server providing the fastest failover time and if standby lan cards are used. Having a standby lan increases the failover time since we have to factor in the time to allow a lan failover should the only remaining HB lan fail. Also the type of lans configured affects things if there is a standby since for example the lan failover of fddi is faster than with ethernet.

In order to minimise the failover times you should reduce the node timeout, heartbeat interval, use a quorum server and live without standby lans.

However, there is a cost to this since without standby lans you risk subnet outages after a single failure so your environment needs to tolerate this, and with a low node timeout you risk false failovers should you experience short hangs or network outages.

Doing this you should be able to get the failover time less than 30 seconds with a node timeout of 2 seconds and a heartbeat interval of 0.5 seconds.

With regards to changing the heartbeat interval and the rule of thumb previously given stating there should be at least 2 or 3 heartbeats per node timeout, I would not suggest having the heartbeat interval larger than 1 second. Since heartbeats are used to communicate between cluster nodes, increasing the interval can have adverse effects during cluster reformations and delays in other operations.

Lastly, if failover time is paramount, you should consider SGeFF (Serviceguard fast failover extension) which could give you a failover time as low as 6 seconds for the same configuration.
John Bigg
Esteemed Contributor

Re: cluster reforming time

oh yes, I forgot to add that increasing the node timeout by a factor of 5 does NOT increase the failover time by a factor of 5. It's more complicated than that as there are several different stages to the failover timings.

So, for example, with a node timeout of 2, a hearbeat interval of 1, a GSC SCSI cluster lock and no standby lans you get a failover time of around 30 seconds. If you increase the node timeout from 2 to 10 the failover time increases not to 150 seconds but to around 120 seconds.
???_185
Regular Advisor

Re: cluster reforming time

Thanks