HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
BladeSystem - General
cancel
Showing results for 
Search instead for 
Did you mean: 

strange NCU behavior

 
srussell
Advisor

strange NCU behavior

On two separate occasions and affecting two different servers the following has occurred:
1) Server A is restarted
2) Server B NCU primary network link fails over to standby link when serverA is restarted.

In more detail we had a maintenance the other night which was to configure a virtual server for a new windows 2003 cluster. The server had been restarted multiple times during this install but on a random restart we noticed that at the exact same time the NCU on our production cluster failed and all connections to its passive node were lost. The production cluster server did not restart but only the NCU appeared to break when the non production cluster server was restarted. All IP info is different between clusters. I have also had this happen to another production cluster server where the same behavior was noted. ServerA would be restarted but then ServerB NCU would fail on primary link.

Has anyone seen behavior like this at all. We are getting quite afraid to restart any HP servers considering the damage it caused the last time.
4 REPLIES
HEM_2
Honored Contributor

Re: strange NCU behavior

Yes, sounds strange...

The first question to ask is what type of failover occurs on Server B in the example above? Is it link loss occurring on the primary network link or does it fail over because of RX or TX path Failed (heartbeats)? You can gather this info by looking at the CPQTeam Log Entries in the System Log.

If it was link loss, ensure that the NICs actually attach to the switch ports that you think they do. Make sure the switch isn't misconfigured for some kind of link state tracking or uplink failure detection.

*** If it was due to RX or TX Path Validation, make sure that both the primary and standby network links are indeed connecting to switch ports that are in the same VLAN. It could be that they are not and that the primary link has been receiving path validation frames (heartbeats) from server A all along and when server A was rebooted, Server B's primary NIC no longer saw path validation frames and failed over to its standby NIC which was in the wrong VLAN resulting in loss of connectivity.

Path Validation frames are L2 multicast and will be heard by all NIC teams in a broadcast domain/VLAN.
srussell
Advisor

Re: strange NCU behavior

ServerA and ServerB are both in same vlan. Port connections are correct for both servers. I have done some reading and would appreciate your thoughts on the following:

1)On the night of the issue ServerA was undergoing maintenance and several restarts occurred
2) The vlan that serverA and B reside in is heavily populated.
3) Broadcast traffic within that vlan could have been high given the restarts and high poplulation of other production servers in vlan.

I have seen that enabling PortFast on a ciscoswitch will resolve such issues. This is not the first time we have encountered this issue. It has occurred between two other servers that also reside in the same vlan, same event ID 434.
HEM_2
Honored Contributor

Re: strange NCU behavior

Having portfast enabled on switch ports where servers are connected is a really good idea. When you don't have portfast on a server switch port and that link goes up or down it causes a Spanning Tree Topology Change. When an STP Topology Change occurs, the MAC Address table for that switch and others in the STP domain are flushed or erased. When that happens all traffic floods until the MAC Addresses are learned again. If you have a lot of Spanning Tree Topology Changes going on it can really cause a lot of flooding to the point where servers get inundated with traffic and may not be able to process things like path validation frames in enough time.

Portfast is highly recommended, especially on a heavily populated VLAN.
srussell
Advisor

Re: strange NCU behavior

It turns out that we do in fact have portfast enabled. I am game for any other reasons why restarting one server in a vlan will cause another servers Nic teaming to fail.