Array Setup and Networking
cancel
Showing results for 
Search instead for 
Did you mean: 

A disruptive non-disruptive failover?

 
SOLVED
Go to solution
Highlighted
Advisor

Re: A disruptive non-disruptive failover?

Alan, please review the following:

Network Control Policy in UCS

Flow Control policy in UCS

Portfast is enabled in switches connecting from FIs

I saw a similar issue in a slightly different configuration and it was down to spanning tree on the uplink switches.

Highlighted
HPE Blogger

Re: A disruptive non-disruptive failover?

Hello Alan,

Did you manage to get in touch with Nimble Support with regards to this issue? Were they able to resolve the problem?

Would you mind posting up any resolution you got, as this could benefit the community.

Nick Dyer
Nimble Field CTO & Evangelist

twitter: @nick_dyer_
Highlighted
Valued Contributor

Re: A disruptive non-disruptive failover?

Hi Nick.

Unfortunately no, not yet.  I have a whole list of more pressing problems so I haven't been able to contact support.  I'll definitely be posting back here when we find a solution, though.

Alan

Highlighted
Valued Contributor

Re: A disruptive non-disruptive failover?

I double-checked those policies and our core switch and they're all set to the recommended configurations.  Spanning tree is a very good thought.  It would describe the problem I'm seeing but I'm using Appliance Ports and have an STP edge directive on our uplinks, so the obvious areas aren't falling victim to a STP  timer.  I'm keeping it in mind for continued research, though.

Thanks!

Alan

Occasional Advisor

Re: A disruptive non-disruptive failover?

We also had our first disruptive failover with 2.2.6.0 and support are investigating

Breaking stuff since forever
Highlighted
Valued Contributor
Solution

Re: A disruptive non-disruptive failover?

Hi all.

I've been working with Support on this issue and we checked a few things I wanted to share.  Our last upgrade this past weekend worked great, but we've also had things work great in the past only to break again.  So, I don't consider these a resolution yet but they did appear to help.  I'll confirm as the next few releases roll out and I install them.

  • Array logs showed that there was an iSCSI login timeout during the last upgrade so hosts didn't reconnect to the datastores in a timely fashion.  We don't use anything beyond initiator WWN authentication so it's not a CHAP issue.
  • I double-checked Discovery IPs.
  • The support engineer double-checked our UCS network control and flow control policies per the integration guide, and as noted above by Amirul, and found them to be correct.
  • The engineer double-checked our NCM installation to make sure that it was indeed still current and running, and yes, it was.
  • The engineer mentioned that he's seen problems before when outdated (read: default) VMWare NIC drivers are used and suggested I make sure the Cisco drivers are current.  I installed the latest Cisco enic bundle on all of our hosts, since we use SW iSCSI.  The drivers, if you're looking for them, are available from the vSphere download pages or as an all-in-one ISO from Cisco.  Only the enic drivers apply to us but make sure to check your fnic or other vHBA drivers as required.

In summary, the one change I made at Support's request was to update the NIC drivers.  I did that, things worked fine during the update, and I'll post back after the next upgrade.

Alan

Highlighted
Valued Contributor

Re: A disruptive non-disruptive failover?

We haven't had a maintenance window in some time but just completed one over the holidays. Everything seemed to be fine this time with the only disruption to one particular Linux VM (which belongs to a family we've had many different issues with before). I think Support's answer regarding the Cisco custom drivers was the best one, since everything else had been checked a few times before.

Hope someone else finds this useful in the future!

Alan