Operating System - HP-UX
1752758 Members
5035 Online
108789 Solutions
New Discussion юеВ

Re: MC/SG & network trouble

 
SOLVED
Go to solution
Mihails Nikitins
Super Advisor

MC/SG & network trouble

Hi,

I'm not sure if I understand MC/SG logics correctly. Imagine the following failure situation. Primary node losses all network connections.

I believe that...
Primary node still runs its packages and keeps the cluster lock disk. The secondary node believes the primary node is down, tries to get lock disks, fails and then either stop its cluster software or makes TOC. There is no standard way to force primary node to stop its packages and release cluster lock in case of network failure.

Thanks and points in advance for your comments!

BR,
Mihails
KISS - Keep It Simple Stupid
6 REPLIES 6
G. Vrijhoeven
Honored Contributor

Re: MC/SG & network trouble

Hi Mihails,

I am afraid the situation is like you say. So if the primairy node loses all netwark traffic and is the first to get the lock this the second node stops it cluster software or makes TOC. What you can do to privent this from happening is:

Create second seperate haertbeat lan ( x cable) or configure a serial heartbeat interface.
You can also take a look at the use of an arbitraded node.

HTH,

Gideon
melvyn burnard
Honored Contributor

Re: MC/SG & network trouble

well this is the way Sg could be expected to react, as you have had a multiple point of failure.
You could look at using a serial heartbeat, although the advice generally is not to use this, or you could put in a dedicated heartbeat lan to a separate hub, on a secure power source.
The other option would be to use a quorum server rather than a lock disc, but be aware that the quorum server can only be addressed via one subnet, so if that subnet were to also die you would lose connectivity to the QS, but again you are looking at what is in all probability a multiple point of failure.
SG is designed to guard against SPOF or Single Point of Failure.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Stephen Doud
Honored Contributor

Re: MC/SG & network trouble

I would like to clarify further my friends statements.

When HB traffic entirely ceases, active nodes attempt to reform a cluster, all the while, continuing to operate their packages. Those that "vote" into the new cluster continue to operate their packages without disturbance. Those that fail to get into the new cluster have to TOC (reboot).
In the situation you described - an entire loss of network connections on the primary, whichever node gets to the cluster lock disk first, has essentially voted to be the new cluster - even if it's NICs are dead.
As Melvyn said, Serviceguard provides a serial-heartbeat feature which enables Serviceguard to keep the current cluster running until SG can descriminate which node has all dead network paths. SG will then force -that- node out of the cluster via TOC.

G. Vrijhoeven mentioned the use of a dedicated crossover HB network - which is another method to insure proper package ownership. This method allows nodes to continue to pass HB, giving nodes time to detect failed package subnets, which in turn can lead SG to perform a package failover to an adoptive node that still has access to the network.

Using a quorum server as an arbitration device would insure the primary node would not succeed in keeping cluster ownership - since a network connection to the quorum server must be functional in order to remain in the cluster after loss of HB paths. A complete network failure on the primary would also prevent access to the quorum server, resulting in a TOC.


-sd
Kent Ostby
Honored Contributor
Solution

Re: MC/SG & network trouble

Mihails --

Just to clarify that when all is running well in a two node cluster, neither node owns the "lock disk".

The "lock disk" is used to break a tie in a two node cluster when the nodes cannot communicate.

So what would actually happen in your scenario is that it would start with a situation where both nodes were communicating and neither node owned the lock disk.

At the point where the primary node lost all of its network connectivity, the nodes at different times would sense that each other had died (from their point of view) and would race to get the lock disk.

If the primary node got there first then it would indeed form a one node cluster and try to run the packages; However, if the secondary node got there first, it would take over and the primary node would kill itself with a TOC.

Best regards,

Kent M. Ostby
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"
Geoff Wild
Honored Contributor

Re: MC/SG & network trouble

What happens is this - if neither node can communicate with the other - then they both go for the cluster lock - the first one their - stays up - the othr TOC's - there is NO guarantee that the node that is running the packages will get the cluster lock.

This is by design.

Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Mihails Nikitins
Super Advisor

Re: MC/SG & network trouble

Hi,

Thanks for the reply. Kent's reply was the most useful for me. Dedicated HB LAN is a good solution, but requires additional NICs. Using of serial heart-beat is prohibitted by docs if you have more than one NIC. Probably, it should work, but I wouldn't like to run something like 'unsupported configuration'.

BR,
Mihails
KISS - Keep It Simple Stupid