Re: A disruptive non-disruptive failover?

aprice119 · ‎03-24-2015

I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.

Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates. The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover. Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes. After a couple of minutes the paths all seem to reconnect and the servers get their drives back. Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through. But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.

Has anyone else seen this kind of behavior? Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases. I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.

The Details

We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance. VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B. We use VMFS datastores and NCM multipathing, which VMware reports is fully functional. I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover. We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.

Valdereth · ‎03-25-2015

Never hurts to start a case with support.

What address zone are you using? (Single Zone, Bisect, Even/Odd)

rfenton4 · ‎03-25-2015

Hi Alan,

I've performed several controller upgrades (physical) and software. Typically you will see a pause in IO as the controllers transition from the Active to Standby controller. If you were pinging the management interface you may see a dropped packet, typically when viewing ESX or a Guest you will see a pause in the IO activity for a few seconds. More often or not this is around 5-10 seconds but it recommend setting SCSI timeout to 120 seconds (which is best practice for pretty much all storage vendors).

This is also the value that NCM will set for Windows/ESX.

valdereth advice is good - if your seeing unexpected behaviour then call to support should be made so they can check over your configuration. I doubt address zoning will come into it if your plugged directly into Fabric Interconnects of your UCS as this really only comes into play when there are inter-switch links in the mix

Cheers

Rich

aprice119 · ‎03-25-2015

Support is my next stop but I was curious to see if anyone else would have noticed the same behavior. I do expect an IO pause during a failover while everything is re-learned, and we used to have those short pauses, but now it's a couple of minutes and noticeably causes servers to stop responding.

I'm using single zones since we split into two distinct subnets on two switches.

Thanks!

Valdereth · ‎03-25-2015

That'd be my lack of UCS experience showing through

The longest period I've where seen dropped pings to the discovery addresses was probably only 5-10 seconds during a software update. This has always been during low I/O periods - I'm not sure if timeouts increase during intense I/O.

marty_booth · ‎03-25-2015

Hi mate

Would be great if you could post any answers you receive from support here. I have much the same setup as you, but I have only done upgrades during our PoC phase, not when running in full production mode.

Regards,

Marty

aprice119 · ‎03-29-2015

Hi all.

I might be onto something but it will take until the next software upgrade to confirm if I've got it fixed. While at the NIOP course it got me thinking to check a setting that changed in NOS 2.x. It turns out it hadn't been set right after the upgrade (my bad). I tried a test failover today and it was as smooth as expected; a software upgrade will be the real challenge. I'll post back as soon as that's done.

Also, Marty, to your request:

Know that Nimble's failovers are supposed to be transparent, and I've seen them work that way personally in the past. Nimble's InfoSight metrics still show well over half of their customers performing software upgrades during business hours. If you haven't seen the blog post from November check it out: Nimble Storage Blog | Go Ahead – Update Your Storage Operating System in the Middle of the Day. Those numbers are still tracking from what I'm heard and I'll be back at midday upgrades once I get this bug worked out.

Alan

Valdereth · ‎03-30-2015

What setting did you change?

aprice119 · ‎03-30-2015

VMware's discovery IPs listed only the array management interface, as per the 1.x guides. They were never updated to use the two new discovery IPs that were configured with 2.x. So, I removed the management address and added the Nimble discovery addresses.

aprice119 · ‎05-08-2015

After updating to 2.2.6.0 today it appears I still have a problem. Our network took a brief pause during the update and vCenter logged an "all paths down" event for every Nimble datastore on each of our hosts. The outage last over two minutes as indicated by other logs that note the event has been over 140 seconds long and the hosts are switching to I/O fast fail mode. Looks like I'll need to open a support ticket.

amirul93 · ‎05-13-2015

Alan, please review the following:

Network Control Policy in UCS

Flow Control policy in UCS

Portfast is enabled in switches connecting from FIs

I saw a similar issue in a slightly different configuration and it was down to spanning tree on the uplink switches.

Nick_Dyer · ‎05-20-2015

Hello Alan,

Did you manage to get in touch with Nimble Support with regards to this issue? Were they able to resolve the problem?

Would you mind posting up any resolution you got, as this could benefit the community.

Nick Dyer
twitter: @nick_dyer_

aprice119 · ‎05-21-2015

Hi Nick.

Unfortunately no, not yet. I have a whole list of more pressing problems so I haven't been able to contact support. I'll definitely be posting back here when we find a solution, though.

Alan

aprice119 · ‎05-21-2015

I double-checked those policies and our core switch and they're all set to the recommended configurations. Spanning tree is a very good thought. It would describe the problem I'm seeing but I'm using Appliance Ports and have an STP edge directive on our uplinks, so the obvious areas aren't falling victim to a STP timer. I'm keeping it in mind for continued research, though.

Thanks!

Alan

CBVista · ‎05-24-2015

We also had our first disruptive failover with 2.2.6.0 and support are investigating

Breaking stuff since forever

aprice119 · ‎07-08-2015

Hi all.

I've been working with Support on this issue and we checked a few things I wanted to share. Our last upgrade this past weekend worked great, but we've also had things work great in the past only to break again. So, I don't consider these a resolution yet but they did appear to help. I'll confirm as the next few releases roll out and I install them.

Array logs showed that there was an iSCSI login timeout during the last upgrade so hosts didn't reconnect to the datastores in a timely fashion. We don't use anything beyond initiator WWN authentication so it's not a CHAP issue.
I double-checked Discovery IPs.
The support engineer double-checked our UCS network control and flow control policies per the integration guide, and as noted above by Amirul, and found them to be correct.
The engineer double-checked our NCM installation to make sure that it was indeed still current and running, and yes, it was.
The engineer mentioned that he's seen problems before when outdated (read: default) VMWare NIC drivers are used and suggested I make sure the Cisco drivers are current. I installed the latest Cisco enic bundle on all of our hosts, since we use SW iSCSI. The drivers, if you're looking for them, are available from the vSphere download pages or as an all-in-one ISO from Cisco. Only the enic drivers apply to us but make sure to check your fnic or other vHBA drivers as required.

In summary, the one change I made at Support's request was to update the NIC drivers. I did that, things worked fine during the update, and I'll post back after the next upgrade.

Alan

aprice119 · ‎01-03-2016

We haven't had a maintenance window in some time but just completed one over the holidays. Everything seemed to be fine this time with the only disruption to one particular Linux VM (which belongs to a family we've had many different issues with before). I think Support's answer regarding the Cisco custom drivers was the best one, since everything else had been checked a few times before.

Hope someone else finds this useful in the future!

Alan

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: A disruptive non-disruptive failover?

A disruptive non-disruptive failover?