Service guard lan switching

jpcast_real · ‎10-01-2004

Hello,

These days I have been testing my two node cluster with ServiceGuard 11.15 . In this node I have 2 fibre network cards , configured with APA , which I am not using for the cluster at all because I do not have fibre connectivity. Apart from this I have a 100 MB/s lan which is the one I use for the hearbeat and packages .

Athos:/opt/sgmgr/bin> lanscan
Hardware Station Crd Hdw Net-Interface NM MAC HP-DLPI DLPI
Path Address In# State NamePPA ID Type Support Mjr#
0/0/0/0 0x00306EC3B263 0 UP lan0 snap0 1 ETHER Yes 119
0/10/0/0 0x00306EF2B72A 1 UP lan1 snap1 2 ETHER Yes 119
0/12/0/0 0x00306EF2B719 2 UP lan2 snap2 3 ETHER Yes 119
LinkAgg0 0x000000000000 900 DOWN lan900 snap900 4 ETHER Yes 119
LinkAgg1 0x000000000000 901 DOWN lan901 snap901 5 ETHER Yes 119
LinkAgg2 0x000000000000 902 DOWN lan902 snap902 6 ETHER Yes 119
LinkAgg3 0x000000000000 903 DOWN lan903 snap903 7 ETHER Yes 119
LinkAgg4 0x000000000000 904 DOWN lan904 snap904 8 ETHER Yes 119
Athos:/opt/sgmgr/bin> ifconfig lan0
lan0: flags=843
inet 174.1.10.13 netmask ffffff00 broadcast 174.1.10.255
Athos:/opt/sgmgr/bin> ifconfig lan1
ifconfig: no such interface
Athos:/opt/sgmgr/bin> ifconfig lan2
ifconfig: no such interface
Athos:/opt/sgmgr/bin> ifconfig lan900
lan900: flags=1843
inet 174.1.51.33 netmask ffffff00 broadcast 174.1.51.255

This is the cluster configuration.

NODE_NAME Athos
NETWORK_INTERFACE lan0
HEARTBEAT_IP 174.1.10.13
FIRST_CLUSTER_LOCK_PV /dev/dsk/c12t0d1
SECOND_CLUSTER_LOCK_PV /dev/dsk/c9t0d2

NODE_NAME Porthos
NETWORK_INTERFACE lan0
HEARTBEAT_IP 174.1.10.14

FIRST_CLUSTER_LOCK_PV /dev/dsk/c8t0d1
SECOND_CLUSTER_LOCK_PV /dev/dsk/c14t0d2

I have configured also two packages with the IPs 174.1.10.15 and 174.1.10.16 . Both packages monitor the SUBNET 174.1.10.0..

This is the test scenario:

- Two packages running in the same node , Athos, and this node connected just by a sinlge lan to a second node , Porthos.

- Then , I take out the lan cable from the Athos NIC and athos losses every network connecvity .

- Then a single node is made in Athos , which do not have network , and Porthos makes a TOC

Is normal this behauviour? In my opinion as Athos do not have network at all should stop both packages and later make a TOC . Both packages should then be started in Porthos..

In include more information...

NODE_NAME Athos
NETWORK_INTERFACE lan0
HEARTBEAT_IP 174.1.10.13
# NETWORK_INTERFACE lan900
# HEARTBEAT_IP 174.1.51.33
FIRST_CLUSTER_LOCK_PV /dev/dsk/c12t0d1
SECOND_CLUSTER_LOCK_PV /dev/dsk/c9t0d2
# List of serial device file names
# For example:
# SERIAL_DEVICE_FILE /dev/tty0p0

# Warning: There are no standby network interfaces for lan0.
# Link Aggregate lan900 contains the following port(s): lan2
# Warning: There are no standby network interfaces for lan900.

#NODE_NAME dartanan
# NETWORK_INTERFACE lan0
# HEARTBEAT_IP 174.1.10.11
# NETWORK_INTERFACE lan900
# HEARTBEAT_IP 174.1.51.11
# FIRST_CLUSTER_LOCK_PV /dev/dsk/c29t0d1
# SECOND_CLUSTER_LOCK_PV /dev/dsk/c33t0d2
# List of serial device file names
# For example:
# SERIAL_DEVICE_FILE /dev/tty0p0

# Warning: There are no standby network interfaces for lan0.
# Link Aggregate lan900 contains the following port(s): lan1,lan2
# Warning: There are no standby network interfaces for lan900.

NODE_NAME Porthos
NETWORK_INTERFACE lan0
HEARTBEAT_IP 174.1.10.14
# NETWORK_INTERFACE lan900
# HEARTBEAT_IP 174.1.51.44
FIRST_CLUSTER_LOCK_PV /dev/dsk/c8t0d1
SECOND_CLUSTER_LOCK_PV /dev/dsk/c14t0d2
# List of serial device file names
# For example:
# SERIAL_DEVICE_FILE /dev/tty0p0

ATHOS

Oct 1 12:48:43 Athos cmcld: Timed out node Porthos. It may have failed.
Oct 1 12:46:22 Athos nmbd[28631]: find_response_record: response packet id 26898 received with no matching record.
Oct 1 12:48:43 Athos cmcld: Attempting to adjust cluster membership
Oct 1 12:48:44 Athos cmclconfd[21091]: Updated file /var/adm/cmcluster/frdump.cmcld.2 for node Athos (length = 123862).
Oct 1 12:48:44 Athos cmcld: lan0 failed
Oct 1 12:46:22 Athos nmbd[28631]: [2004/10/01 12:46:22, 0] nmbd/nmbd_responserecordsdb.c:(234)
Oct 1 12:48:44 Athos above message repeats 4 times
Oct 1 12:48:44 Athos cmcld: Subnet 174.1.10.0 down
Oct 1 12:46:22 Athos nmbd[28631]: find_response_record: response packet id 26899 received with no matching record.
Oct 1 12:48:44 Athos above message repeats 2 times
Oct 1 12:48:44 Athos cmcld: Subnet 174.1.10.0 in package pkg-oracle is down.
Oct 1 12:48:44 Athos cmcld: Executing '/etc/cmcluster/pkg-oracle/pkg-oracle.cntl stop' for package pkg-oracle, as service PKG*10241.
Oct 1 12:48:44 Athos cmcld: Subnet 174.1.10.0 in package pkg-bhs is down.
Oct 1 12:48:44 Athos cmcld: Executing '/etc/cmcluster/pkg-bhs/pkg-bhs.cntl stop' for package pkg-bhs, as service PKG*14082.
Oct 1 12:48:44 Athos cmcld: All cluster monitoring LAN interfaces have failed
Oct 1 12:48:45 Athos CM-pkg-oracle[21656]: cmhaltserv oracle-monitor
Oct 1 12:48:45 Athos su: + tty?? root-mad
Oct 1 12:48:46 Athos cmcld: Obtaining First Dual Cluster Lock
Oct 1 12:48:47 Athos cmcld: Obtaining Second Dual Cluster Lock
Oct 1 12:48:48 Athos cmcld: Turning off safety time protection since the cluster
Oct 1 12:48:46 Athos su: + tty?? root-mad
Oct 1 12:48:48 Athos cmcld: may now consist of a single node. If ServiceGuard
Oct 1 12:48:48 Athos cmcld: fails, this node will not automatically halt
Oct 1 12:49:49 Athos cmcld: 1 nodes have formed a new cluster, sequence #2
Oct 1 12:49:49 Athos cmcld: The new active cluster membership is: Athos(id=1)
Oct 1 12:49:49 Athos cmcld: Package pkg-oracle cannot run on this node because subnet 174.1.10.0 is not up
Oct 1 12:49:49 Athos cmcld: Package pkg-bhs cannot run on this node because subnet 174.1.10.0 is not up
Oct 1 12:53:04 Athos su: + tty?? root-mad
Oct 1 12:53:05 Athos CM-pkg-oracle[22098]: cmmodnet -r -i 174.1.10.15 174.1.10.0
Oct 1 12:53:06 Athos cmcld: Service scsupv terminated due to an exit(1).
Oct 1 12:53:06 Athos LVM[22129]: vgchange -a n vg01
Oct 1 12:53:06 Athos LVM[22137]: vgchange -a n vg04
Oct 1 12:53:06 Athos cmcld: Service PKG*10241 terminated due to an exit(0).
Oct 1 12:53:06 Athos cmcld: Halted package pkg-oracle on node Athos.
Oct 1 12:53:06 Athos cmcld: Package pkg-oracle cannot run on this node because subnet 174.1.10.0 is not up
Oct 1 12:53:09 Athos CM-pkg-bhs[22145]: cmhaltserv scsupv

PORTHOS

Oct 1 12:48:44 Porthos cmcld: Timed out node Athos. It may have failed.
Oct 1 12:48:44 Porthos cmcld: Attempting to form a new cluster
Oct 1 12:48:45 Porthos cmclconfd[4668]: Updated file /var/adm/cmcluster/frdump.cmcld.7 for node Porthos (length = 80688).
Oct 1 12:48:48 Porthos cmcld: Obtaining First Dual Cluster Lock
Oct 1 12:48:49 Porthos cmcld: First Cluster lock was denied. Lock was obtained by another node.
Oct 1 12:48:52 Porthos inetd[4851]: registrar/tcp: Connection from Porthos (174.1.10.14) at Fri Oct 1 12:48:52 2004
Oct 1 12:48:52 Porthos cmcld: Cluster lock has been denied

Here rests one who was not what he wanted and didn't want what he was

Sanjay_6 · ‎10-01-2004

Hi,

In Service guard, in a two node cluster situation if there is network loss between two nodes, whichever node is able to get hold of the cluster lock disk first will stay up and the other node will do a TOC. It is very difficult to predict which node will do a TOC. It all depends on which node is able to grab the cluster lock disk first.

Hope this helps.

Regds

jpcast_real · ‎10-01-2004

Hello Sajai,

thanks for the answer but I do not agree with you . When a node from a service guard cluster has lost its LAN link it shouldn't go on working with the cluster servcice...

I have read it a long time ago in previous releases

Here rests one who was not what he wanted and didn't want what he was

Sridhar Bhaskarla · ‎10-01-2004

Hi,

Yes. That's the normal behaviour. In your case serviceguard treats it NOT really a 'network failure' but a 'heartbeat failure' as you configured the LANs as heartbeat LANs.

In such situation whichever the node acquires the cluster lock stays and the other gets TOC'ed.

Try configuring a second private heartbeat and try the same test.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

RAC_1 · ‎10-01-2004

If I understood your post correctly, the packages use single NIC-lan0, The cluster also uses it as hearbeat.

So what you are seeing is exactly right. PAckages running on athos, it notices a network problems, and keeps the packages. The other node does a TOC.

I am sure that if you start the packages on other node and pull out the cable, athos will do a TOC and other node will form single node cluster and keep packages running on it (is it owns the disk lock).

Anil

There is no substitute to HARDWORK

melvyn burnard · ‎10-01-2004

you have a cluster with only one network card configured, and hence a SPOF.
By removing the cable from Athis, you have severed all communication (as far as SG is concerned) between the nodes, and hence they will both go for the cluster lock disc.
WHoever gets the cluster lock disc first will stay up, and the other node will TOC.
This is arbitrary, although usually the node that is the cluster co-ordinator.

to prevent this, you need at least one further lan, either a standby for the primary, or another heartbeat lan, or preferably both.

SG has behaved as expected is the bottom line.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

jpcast_real · ‎10-01-2004

Ok , you win,

all my life I had been thinking that when a node do not have network at all it was not possible for it to have the cluster running and made a TOC after sttopping the packages.

I rely on you ....

Thanks a lot for your help..

Here rests one who was not what he wanted and didn't want what he was

Sridhar Bhaskarla · ‎10-01-2004

Javier,

"HEARTBEAT" is the key here. If you have a heartbeat net and a production net and if production net fails on one node where the package is running but still the heartbeat is present, then the package will failover. There won't be a TOC. So, I suggest you configure your lan0 as STATIONARY instead of heartbeat and another interface as heartbeat and if you pull out lan0, then the behaviour will be different.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

melvyn burnard · ‎10-01-2004

I am afraid I have to disagree with the last post. If you have more than one lan and it is supported, then have them all configured as heartbeat lan's. This allows the heartbeats to be exchanged across all the lan's, and if one dies, the other will still be able to communicate, and should prevent a TOC.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Sundar_7 · ‎10-01-2004

I dont want to repeat what others said :-)

But yes, since you have only one lan card configured under MC-SG, if this lan card goes down, both the nodes think that the other node is down and go for the cluster lock disk.

Whosoever was able to acquire the cluster lock, will reform the cluster. In the other node, safety timer will expire and the node will be TOCed.

Here, you LAN is the single point of failure.

My suggestion would be to add one more lan card in the same subnet, configure that as standby to the primary lan card.

In this case, if you unplug lan0, both HEARTBEAT and DATA will fail over to the standby lan card.

If you unplug the secondary lan card too, then both the nodes think the other node is down and the node that manages to grab the cluster lock disk with reform the cluster.

I dont know if cross-over using fibre cards are possible :-).

But that is what I have. I have a dedicated cross-over ethernet private network from node1 to node2 that serves are HEARTBEAT LAN.

Learn What to do ,How to do and more importantly When to do ?

Sridhar Bhaskarla · ‎10-01-2004

My last post was *not* intended as a recommended configuration. I would never configure only one heartbeat for serviceguard. I always configure two private dedicated heartbeats as much as possible. I go for data network as heartbeat only if I can't get atleast two dedicated heartbeats.

That was intended to differentiate between a heartbeat failure and a simple data network failure. In the first case if *all* the heartbeats fail (his original issue), then the node that cannot acquire the lockdisk will TOC itself. If the heartbeat is there and if there is a network failure on the subnet monitored by the package on the node running the package, then the package will simply failover and there would be no TOC.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Service guard lan switching

Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching

Re: Service guard lan switching