1835978 Members
1926 Online
110088 Solutions
New Discussion

Service guard Network

 
jpcast_real
Regular Advisor

Service guard Network

Hello ,

I have a SG environment with three nodes and 3 lan interfaces in each node . Two lan interfaces are grouped in an HP-APA :

0/0/0/0 0x00306EC3B259 0 UP lan0 snap0 1 ETHER Yes 119
LinkAgg0 0x00306EF2B715 900 UP lan900 snap900 4 ETHER Yes 119


NODE_NAME Athos
NETWORK_INTERFACE lan0
HEARTBEAT_IP 15.15.15.33
NETWORK_INTERFACE lan900
HEARTBEAT_IP 174.1.51.33

Two questions:

- today I was making some tests taking out network cables from the lan cards . First I had a 3 node cluster and when I removed all network links from one node I got a 2 node cluster . In this situation I did the same , I removed all the network links from one node and as I hoped there was a failure in the communication and I got a single node cluster . However the node which took the cluster was the one which didn't have network connectivity at all . How is this possible ?? . I thought that when a node loses the network it makes a TOC. In this situation when I made cmviewcl , SG hadn't realized that all network cards were down..?????

- How can I make that one of my network cards is used just for heartbeat and not for data?? I want to transfer the cluster to another node in case the APA is down , independently the state of the other network card.

Thanks
Here rests one who was not what he wanted and didn't want what he was
8 REPLIES 8
Geoff Wild
Honored Contributor

Re: Service guard Network

Interesting, I was led to believe that you can NOT run both APA and SG on the same server......

As far as your issue - do you have a cluster lock disk? Did the cluster reform as a cluster of one? If yes, then the server you removed all network cables got th cluster lock....


Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
jpcast_real
Regular Advisor

Re: Service guard Network

APA is supported in a SG environment , at least this is what HP says. I have dual cluster lock disk and in the logs files the remaining system gets perfectly this disks:

Apr 26 09:56:55 Athos cmcld: lan0 failed
Apr 26 09:56:55 Athos cmcld: Subnet 15.15.15.0 downApr 26 09:58:34 Athos cmcld: Timed out node Porthos. It may have failed.
Apr 26 09:58:34 Athos cmcld: Attempting to adjust cluster membership
Apr 26 09:58:35 Athos cmclconfd[1866]: Updated file /var/adm/cmcluster/frdump.cm
cld.9 for node Athos (length = 402537).
Apr 26 09:58:36 Athos cmcld: Link level address on network interface lan900 has
been updated from 0x00306ef2b719 to 0x000000000000.
Apr 26 09:58:37 Athos cmcld: Obtaining First Dual Cluster Lock
Apr 26 09:58:38 Athos cmcld: Obtaining Second Dual Cluster Lock
Apr 26 09:58:39 Athos cmcld: Turning off safety time protection since the cluster
Apr 26 09:58:39 Athos cmcld: may now consist of a single node. If ServiceGuard
Apr 26 09:58:39 Athos cmcld: fails, this node will not automatically halt
Apr 26 09:58:57 Athos cmcld: GS connection to 15.15.15.44 not responding, closing

It seems that the system knows that lan0 has failed but if you make a cmviewcl in this moment you can see the interface up . Communication is lost but as the interface is up for SG the node without network gets the cluster
Here rests one who was not what he wanted and didn't want what he was
Steven E. Protter
Exalted Contributor

Re: Service guard Network

Based on a fresh SG class APA should work with SG

How to make one of your NIC cards heartbeat only:

Get a hub, make sure it has reliable electrical power, plug into that. Use those IP addresses exclusively for SG heartbeat. Don't assign host names, don't run any data through there.

SG requires link level connectivity for a heartbeat. No routers allowed.

I would think based on the loss of heartbeat on a three node cluster, two of the nodes should have gone TOC if you removed all of the network cables. I guess you didn't do them all at the same time and heartbeat was maintained.

Perhaps a quorum server would be in order.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Geoff Wild
Honored Contributor

Re: Service guard Network

Yes - you are right about APA and SG:

http://docs.hp.com/cgi-bin/fsearch/framedisplay?top=/hpux/onlinedocs/J4240-90021/J4240-90021_top.html&con=/hpux/onlinedocs/J4240-90021/00/00/50-con.html&toc=/hpux/onlinedocs/J4240-90021/00/00/50-toc.html&searchterms=apa&queryid=20040426-103102

That is good news for me.

I still say it's the fact that the node that got the cluster lock

Apr 26 09:58:37 Athos cmcld: Obtaining First Dual Cluster Lock
Apr 26 09:58:38 Athos cmcld: Obtaining Second Dual Cluster Lock

is the reason the node stay's up.

Now as to why a cmviewcl -v says lan0 is up when there are no cables - not too sure...

What version of MC/SG are you running?

Rgds...Geoff





Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
jpcast_real
Regular Advisor

Re: Service guard Network

I have Service Guard 11.15
Here rests one who was not what he wanted and didn't want what he was

Re: Service guard Network

So you start with a 3 node cluster and pull all your connections from 1 node... Serviceguard correctly detects the failure and reforms into a 2 node cluster (the 2 remaining nodes forming a quorum of 2/3 or 66% so thats OK - they can both see each other and neither of them can see the disconnected node so they *know* that node will be toast as it will have achieved 1/3 or less than 50% and it will have TOC's itself).

Now you have a two node cluster, and thats always a special situation, as there's no-one else apart from each other to arbitrate with.... so you pull all the connections from 1 node - now neither node can talk to the other BUT they neither node can form a quorum (1/2 = 50% - not a quorum). Neither node has a quorum of less than 50%, so no TOC straight away and as neither node can talk to the other how can they *know* that the other node doesn't still have network access? (the failure could have been on a network component in-between both nodes). In this situation, simply forming a 1-node cluster just cos you can still see the network could lead to 2 1-node clusters and a split brain situation - that means corrupt data! So the only safe thing to do is use the cluster lock - in this situation that means going for the cluster lock disk. According to your posted logs, it looks like Athos got there before Porthos, and Porthos was TOC'd. If Porthos was the node with 'good' network connections that's just bad luck I'm afraid - you have simulated multiple points of failure after all.

Now on to those cluster lock disks - are you using two disk arrays in some sort of stretch cluster - there are only very special situations where you should use dual cluster lock disks - in many standard scenarios this can actually reduce availability. Review what the manual has to say about dual cluster lock disks here:

http://docs.hp.com/hpux/onlinedocs/B3936-90073/B3936-90073.html

See Chapter 3, the section on how the Cluster Manager works.

As suggested above a quorum server may work better for you - it would certainly prevent the situation you had when losing network connectivity in a 2 node cluster.

How can you be sure that a NIC is used for heartbeat only? Don't use it! By which I mean don't bind you application to that IP, and don't allow your clients to connect to it (by not advertising it in DNS or configuring it on the clients).


HTH

Duncan

I am an HPE Employee
Accept or Kudo
jpcast_real
Regular Advisor

Re: Service guard Network

I have opened a case in Hewlett Packard responce center and they have told me that there is a bug in the 11.15 Service Guard release . Cluster should check if network connections are up before starting the services .

I hope HP Labs will solve this problem as soon as possible.
Here rests one who was not what he wanted and didn't want what he was
Radhakrishnan Venkatara
Trusted Contributor

Re: Service guard Network

Happens...

I had the same problem..The server which gets the cluster lock disk first forms the single node cluster ... But it didn't check whether the network is available or not ...
The server was in production now ... so we couldn't do much of the testing on that ...
Negative thinking is a highest form of Intelligence