Array Setup and Networking
1747988 Members
5133 Online
108756 Solutions
New Discussion

Re: A disruptive non-disruptive failover?

 
SOLVED
Go to solution
aprice119
Valued Contributor

A disruptive non-disruptive failover?

I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.

Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates.  The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover.  Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes.  After a couple of minutes the paths all seem to reconnect and the servers get their drives back.  Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through.  But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.

Has anyone else seen this kind of behavior?  Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases.  I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.

The Details

We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance.  VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B.  We use VMFS datastores and NCM multipathing, which VMware reports is fully functional.  I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover.  We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.

16 REPLIES 16
Valdereth
Trusted Contributor

Re: A disruptive non-disruptive failover?

Never hurts to start a case with support.

What address zone are you using?  (Single Zone, Bisect, Even/Odd)

rfenton4
Honored Contributor

Re: A disruptive non-disruptive failover?

Hi Alan,

I've performed several controller upgrades (physical) and software.  Typically you will see a pause in IO as the controllers transition from the Active to Standby controller.  If you were pinging the management interface you may see a dropped packet, typically when viewing ESX or a Guest you will see a pause in the IO activity for a few seconds.  More often or not this is around 5-10 seconds but it recommend setting SCSI timeout to 120 seconds (which is best practice for pretty much all storage vendors).

This is also the value that NCM will set for Windows/ESX.

valdereth advice is good - if your seeing unexpected behaviour then call to support should be made so they can check over your configuration.   I doubt address zoning will come into it if your plugged directly into Fabric Interconnects of your UCS as this really only comes into play when there are inter-switch links in the mix

Cheers

Rich

aprice119
Valued Contributor

Re: A disruptive non-disruptive failover?

Support is my next stop but I was curious to see if anyone else would have noticed the same behavior.  I do expect an IO pause during a failover while everything is re-learned, and we used to have those short pauses, but now it's a couple of minutes and noticeably causes servers to stop responding.

I'm using single zones since we split into two distinct subnets on two switches.

Thanks!

Valdereth
Trusted Contributor

Re: A disruptive non-disruptive failover?

That'd be my lack of UCS experience showing through

The longest period I've where seen dropped pings to the discovery addresses was probably only 5-10 seconds during a software update.  This has always been during low I/O periods - I'm not sure if timeouts increase during intense I/O.

marty_booth
Occasional Advisor

Re: A disruptive non-disruptive failover?

Hi mate

Would be great if you could post any answers you receive from support here. I have much the same setup as you, but I have only done upgrades during our PoC phase, not when running in full production mode.

Regards,

Marty

aprice119
Valued Contributor

Re: A disruptive non-disruptive failover?

Hi all.

I might be onto something but it will take until the next software upgrade to confirm if I've got it fixed.  While at the NIOP course it got me thinking to check a setting that changed in NOS 2.x.  It turns out it hadn't been set right after the upgrade (my bad).  I tried a test failover today and it was as smooth as expected; a software upgrade will be the real challenge.  I'll post back as soon as that's done.

Also, Marty, to your request:

Know that Nimble's failovers are supposed to be transparent, and I've seen them work that way personally in the past.  Nimble's InfoSight metrics still show well over half of their customers performing software upgrades during business hours.  If you haven't seen the blog post from November check it out: Nimble Storage Blog | Go Ahead – Update Your Storage Operating System in the Middle of the Day.  Those numbers are still tracking from what I'm heard and I'll be back at midday upgrades once I get this bug worked out.

Alan

Valdereth
Trusted Contributor

Re: A disruptive non-disruptive failover?

What setting did you change?

aprice119
Valued Contributor

Re: A disruptive non-disruptive failover?

VMware's discovery IPs listed only the array management interface, as per the 1.x guides.  They were never updated to use the two new discovery IPs that were configured with 2.x.  So, I removed the management address and added the Nimble discovery addresses.

aprice119
Valued Contributor

Re: A disruptive non-disruptive failover?

After updating to 2.2.6.0 today it appears I still have a problem.  Our network took a brief pause during the update and vCenter logged an "all paths down" event for every Nimble datastore on each of our hosts.  The outage last over two minutes as indicated by other logs that note the event has been over 140 seconds long and the hosts are switching to I/O fast fail mode.  Looks like I'll need to open a support ticket.