Operating System - HP-UX
1855389 Members
4644 Online
104110 Solutions
New Discussion

How to survive a brief but complete network outage in MC/SG ?

 
Q4you
Regular Advisor

How to survive a brief but complete network outage in MC/SG ?

We have multiple redundant LAN cards on a 2 nodes in a MC/SG cluster ( HPUX 11i, rp8400s) but we are likely to face a brief and complete network( say 5 mins) outage on a MC/SG cluster ?

We don't want any server panic or shutdown the pkgs during the outage, just disabling the pkg switching and disabled alternate node will keep it running on primary node ( though not accessible for few minutes ?) Any ideas /suggestions ?

Thnx

-Q
13 REPLIES 13
David Child_1
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

I believe your best bet is to shutdown the package. You could disable the switching, but that won't stop the package from failing.

David
melvyn burnard
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

well if youare going to have a complete network outage, is it sensible to keep the packages up?
If you still wish to do this, then move all packages on to one node, and do a cmhaltnode on the other node. once the network is back, cmrunnode on that node and then redistribute any packages you want to run back on to that node
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Jeff Schussele
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

Hi Q,

Well - two things:

1) DON'T run the heartbeats on the public net - create a private net.
2) You'd have to set the polling for the public subnets higher than the longest anticipated outage. But that's problematic because it will make the failover time that much higher.

Ideally you need to have the diff public LAN NICs on diff switches such that the NIC will failover quickly on a switch/NIC failure.
BUT if the entire network is liable to fail then it makes the network ITSELF the SPOF and what good does cluster SW do then?
Seems to me that mgmnt needs to address this *fundamental* issue, eh?

My 2 cents,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Q4you
Regular Advisor

Re: How to survive a brief but complete network outage in MC/SG ?

Thanks for you inputs.

Forgot to mention that we have a completely isolated private heartbeat LAN which will stay up during the network outage. I know it does not make sense to keep the pkg running but if it is 5 min outage on network and if all comes back as it was, we want to avoid the pkg/cluster stop/start.

If it was a single cluster, we could do it but we are talking about 12 such clusters for 12 business units ! The pkg/stop start and verification involves a "whole new management mess"...you know what I mean :)
melvyn burnard
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

So I would go with switching all packages to one of the nodes, cmhaltnode the 2nd node etc...
This would pre-empt a possible TOC if something WERE to go wrong with that private heartbeat link.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
A. Clay Stephenson
Acclaimed Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

I think this really points to a weakness in your cluster design because it is possible to have a complete network failure even if intended and planned. You should be able to kill half of your switches/routers so that a network failover to the alternate LAN's occur and do your maintenance and then kill the other half. If your network isn't this robust then you have not the passed MC/SG 101 course.
If it ain't broke, I can fix that.
Steven E. Protter
Exalted Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

You should have a private network,not your main lan connecting the built in LAN cards of all your servers.

A prequisite is all servers have an add in card that talks to the main "public" network at your organization.

Reasons for this private network:
1) SG heartbeat, primary should be here, set a secondary on the public network.
2) Being able to do Ignite boots to recover systems that have had major hardware issues.

Ignite won't boot off add in cards.

The setup above should let you run SG through planned or unplanned network failures. If you hang your private switch/hub off the same UPS as your 9000 servers, you will get a chance to gracefully shut down your systems and cluster.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Q4you
Regular Advisor

Re: How to survive a brief but complete network outage in MC/SG ?

Clay,

The problem is, the situation warrants that we *can not * have partial network failure. Only private hearttbeat LAN will remain up.

Public nw ( with all VLANs/Subnets) is going down completely for few mins.

So the question is,

Will the cluster or pkgs survive it, if pkg switching disabled and recover w/o any intervention ?
A. Clay Stephenson
Acclaimed Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

Okay, about the best answer anyone can give you is "maybe" because we haven't tested your package control scripts. If this were me, I would do a package halt and restart after the network is back up. At least this way, you are in a controlled and well-defined
state. I would rather be in the position of saying "this will work" rather than "this may work" or "I think this will work" --- especially if business continuity is on the line. Besides, how do you know that all the network is going to be back up in 5 minutes?
I sure wouldn't trust anyone if they told me that.

All of this should have really been tested on your Sandbox Cluster/Network and then you could answer the question. Don't have one because it's too expensive? This is why you have one.
If it ain't broke, I can fix that.
Jeff Schussele
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

Hi Q,

I will repeat...

IF you set the polling interval on the public LANs *high* enough to exceed the "anticipated" outages - YES. Otherwise - NO.
But AGAIN consider these three things:
1) This will cause any "normal" (term evidently used loosely in your orginization) network outages to cause *much* higher failover times
2) Ditto for any NIC failure
3) Do you seriously think that the times the network folks give you are realistic?

My 2 cemts,
Jeff

P.S. To me this is a situation that *begs* to be solved at a much deeper level than you appear to realize. Clusters are absolutely useless in light of an attitude that appears to be in play here.

P.P.S Please don't take this personally, but there *are* things that need to be stood up to in our profession & IMHO this is definitely one of them.

Best Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
rick jones
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

Ignoring SG-specific issues, a five minute shutdown of the network is likely to cause at least a fraction of the TCP connections into or out of the complex to fail. On HP-UX the tcp_ip_abort_interval is ten minutes, with a maximum retransmission timer of one minute (iirc) so at that end you are _probably_ OK. However, other stacks only do a small number of retransmissions and that may take less than five minutes to complete. IIRC Widnows will only retransmit five times, with the first several retransmissions less than one minute each.

BTW, that ten minute tcp_ip_abort_interval on HP-UX is the _default_ - there are applications out there, desiring "fast connection failure notification" that suggest setting those values lower - sometimes even as low as 60 seconds. That those applications should be using an _application-level_ mechanism for their detection is often lost on those developers... so _all_ applications on the system end-up with the short failure time.
there is no rest for the wicked yet the virtuous have no pillows
Florian Heigl (new acc)
Honored Contributor

Re: How to survive a brief but complete network outage in MC/SG ?

I'm a bit confused by these postings - if the private heartbeat network stays up, as he wrote, won't the cluster stay up w/o any problems?
I'd expect the public heartbeat in the described case to be only a secondary means.

(Also, I'd run a slip line over fibre converters as a second private line, but that's mine personal madness)
yesterday I stood at the edge. Today I'm one step ahead.
Q4you
Regular Advisor

Re: How to survive a brief but complete network outage in MC/SG ?

Ok..Live and Learn !

*All* clusters stayed up, no node paniced ( as HB was alive). The packages which had SUBNET monitoring disabled ( commnented in the pkg config) stayed up, no extra steps required.

The packages which had subnet monitoring enabled, took graceful halt(even though pkg switching was disabled before the change), required restart.

We will be making required changes so we do not see this situation again of network outage and configurations.

The systems which did not have MC/SG survived better and recovered upon network restore w/o any action.

Thanks for your inputs !
-Q