- Community Home
- >
- Storage
- >
- HPE Nimble Storage
- >
- Array Setup and Networking
- >
- A disruptive non-disruptive failover?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-24-2015 05:46 PM
тАО03-24-2015 05:46 PM
I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.
Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates. The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover. Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes. After a couple of minutes the paths all seem to reconnect and the servers get their drives back. Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through. But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.
Has anyone else seen this kind of behavior? Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases. I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.
The Details
We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance. VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B. We use VMFS datastores and NCM multipathing, which VMware reports is fully functional. I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover. We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-25-2015 06:40 AM
тАО03-25-2015 06:40 AM
Re: A disruptive non-disruptive failover?
Never hurts to start a case with support.
What address zone are you using? (Single Zone, Bisect, Even/Odd)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-25-2015 06:56 AM
тАО03-25-2015 06:56 AM
Re: A disruptive non-disruptive failover?
Hi Alan,
I've performed several controller upgrades (physical) and software. Typically you will see a pause in IO as the controllers transition from the Active to Standby controller. If you were pinging the management interface you may see a dropped packet, typically when viewing ESX or a Guest you will see a pause in the IO activity for a few seconds. More often or not this is around 5-10 seconds but it recommend setting SCSI timeout to 120 seconds (which is best practice for pretty much all storage vendors).
This is also the value that NCM will set for Windows/ESX.
valdereth advice is good - if your seeing unexpected behaviour then call to support should be made so they can check over your configuration. I doubt address zoning will come into it if your plugged directly into Fabric Interconnects of your UCS as this really only comes into play when there are inter-switch links in the mix
Cheers
Rich
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-25-2015 07:21 AM
тАО03-25-2015 07:21 AM
Re: A disruptive non-disruptive failover?
Support is my next stop but I was curious to see if anyone else would have noticed the same behavior. I do expect an IO pause during a failover while everything is re-learned, and we used to have those short pauses, but now it's a couple of minutes and noticeably causes servers to stop responding.
I'm using single zones since we split into two distinct subnets on two switches.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-25-2015 07:27 AM
тАО03-25-2015 07:27 AM
Re: A disruptive non-disruptive failover?
That'd be my lack of UCS experience showing through
The longest period I've where seen dropped pings to the discovery addresses was probably only 5-10 seconds during a software update. This has always been during low I/O periods - I'm not sure if timeouts increase during intense I/O.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-25-2015 08:36 PM
тАО03-25-2015 08:36 PM
Re: A disruptive non-disruptive failover?
Hi mate
Would be great if you could post any answers you receive from support here. I have much the same setup as you, but I have only done upgrades during our PoC phase, not when running in full production mode.
Regards,
Marty
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-29-2015 11:08 AM
тАО03-29-2015 11:08 AM
Re: A disruptive non-disruptive failover?
Hi all.
I might be onto something but it will take until the next software upgrade to confirm if I've got it fixed. While at the NIOP course it got me thinking to check a setting that changed in NOS 2.x. It turns out it hadn't been set right after the upgrade (my bad). I tried a test failover today and it was as smooth as expected; a software upgrade will be the real challenge. I'll post back as soon as that's done.
Also, Marty, to your request:
Know that Nimble's failovers are supposed to be transparent, and I've seen them work that way personally in the past. Nimble's InfoSight metrics still show well over half of their customers performing software upgrades during business hours. If you haven't seen the blog post from November check it out: Nimble Storage Blog | Go Ahead тАУ Update Your Storage Operating System in the Middle of the Day. Those numbers are still tracking from what I'm heard and I'll be back at midday upgrades once I get this bug worked out.
Alan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-30-2015 06:02 AM
тАО03-30-2015 06:02 AM
Re: A disruptive non-disruptive failover?
What setting did you change?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-30-2015 09:30 AM
тАО03-30-2015 09:30 AM
Re: A disruptive non-disruptive failover?
VMware's discovery IPs listed only the array management interface, as per the 1.x guides. They were never updated to use the two new discovery IPs that were configured with 2.x. So, I removed the management address and added the Nimble discovery addresses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-08-2015 05:27 PM
тАО05-08-2015 05:27 PM
Re: A disruptive non-disruptive failover?
After updating to 2.2.6.0 today it appears I still have a problem. Our network took a brief pause during the update and vCenter logged an "all paths down" event for every Nimble datastore on each of our hosts. The outage last over two minutes as indicated by other logs that note the event has been over 140 seconds long and the hosts are switching to I/O fast fail mode. Looks like I'll need to open a support ticket.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-13-2015 02:33 PM
тАО05-13-2015 02:33 PM
Re: A disruptive non-disruptive failover?
Alan, please review the following:
Network Control Policy in UCS
Flow Control policy in UCS
Portfast is enabled in switches connecting from FIs
I saw a similar issue in a slightly different configuration and it was down to spanning tree on the uplink switches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-20-2015 12:57 AM
тАО05-20-2015 12:57 AM
Re: A disruptive non-disruptive failover?
Hello Alan,
Did you manage to get in touch with Nimble Support with regards to this issue? Were they able to resolve the problem?
Would you mind posting up any resolution you got, as this could benefit the community.
twitter: @nick_dyer_
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-21-2015 01:38 PM
тАО05-21-2015 01:38 PM
Re: A disruptive non-disruptive failover?
Hi Nick.
Unfortunately no, not yet. I have a whole list of more pressing problems so I haven't been able to contact support. I'll definitely be posting back here when we find a solution, though.
Alan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-21-2015 01:41 PM
тАО05-21-2015 01:41 PM
Re: A disruptive non-disruptive failover?
I double-checked those policies and our core switch and they're all set to the recommended configurations. Spanning tree is a very good thought. It would describe the problem I'm seeing but I'm using Appliance Ports and have an STP edge directive on our uplinks, so the obvious areas aren't falling victim to a STP timer. I'm keeping it in mind for continued research, though.
Thanks!
Alan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2015 04:46 PM
тАО05-24-2015 04:46 PM
Re: A disruptive non-disruptive failover?
We also had our first disruptive failover with 2.2.6.0 and support are investigating
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО07-08-2015 08:10 AM
тАО07-08-2015 08:10 AM
SolutionHi all.
I've been working with Support on this issue and we checked a few things I wanted to share. Our last upgrade this past weekend worked great, but we've also had things work great in the past only to break again. So, I don't consider these a resolution yet but they did appear to help. I'll confirm as the next few releases roll out and I install them.
- Array logs showed that there was an iSCSI login timeout during the last upgrade so hosts didn't reconnect to the datastores in a timely fashion. We don't use anything beyond initiator WWN authentication so it's not a CHAP issue.
- I double-checked Discovery IPs.
- The support engineer double-checked our UCS network control and flow control policies per the integration guide, and as noted above by Amirul, and found them to be correct.
- The engineer double-checked our NCM installation to make sure that it was indeed still current and running, and yes, it was.
- The engineer mentioned that he's seen problems before when outdated (read: default) VMWare NIC drivers are used and suggested I make sure the Cisco drivers are current. I installed the latest Cisco enic bundle on all of our hosts, since we use SW iSCSI. The drivers, if you're looking for them, are available from the vSphere download pages or as an all-in-one ISO from Cisco. Only the enic drivers apply to us but make sure to check your fnic or other vHBA drivers as required.
In summary, the one change I made at Support's request was to update the NIC drivers. I did that, things worked fine during the update, and I'll post back after the next upgrade.
Alan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2016 08:41 AM
тАО01-03-2016 08:41 AM
Re: A disruptive non-disruptive failover?
We haven't had a maintenance window in some time but just completed one over the holidays. Everything seemed to be fine this time with the only disruption to one particular Linux VM (which belongs to a family we've had many different issues with before). I think Support's answer regarding the Cisco custom drivers was the best one, since everything else had been checked a few times before.
Hope someone else finds this useful in the future!
Alan