- Community Home
- >
- Storage
- >
- HPE Nimble Storage
- >
- Array Setup and Networking
- >
- Re: A disruptive non-disruptive failover?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-24-2015 05:46 PM
03-24-2015 05:46 PM
I thought about opening a support ticket but decided that I'd reach out to the friendly Nimble community (not to be confused with your Friendly Neighborhood SE) to start a discussion and find out what I must have done wrong.
Since sometime early in NOS 2.x, possibly about when I switched over to using NCM for path management, we have a bit of an issue during our formerly non-disruptive updates. The first few upgrades after we bought our Nimble we done midday and never caused a stir but lately our whole network takes a pause after the controller failover. Watching the Nimble during the upgrade I see that the tge interfaces have no traffic and, consequently, neither do the volumes. After a couple of minutes the paths all seem to reconnect and the servers get their drives back. Luckily our VMware guests handle this pretty well and just kind of sit there while they wait for their disk requests to go through. But, I suspect it's not a great idea to tear out a bunch of hard drives while they're in use.
Has anyone else seen this kind of behavior? Our topology hasn't changed except for NOS 2.x (currently 2.5.0), NCM, and some UCS firmware releases. I've reviewed all of the setup guides a couple of times to make sure I'm not doing something obvious, so I hope this turns out to not be something obvious.
The Details
We have a Nimble CS220G-X8 connected to a pair of Cisco UCS FIs as a direct-attached appliance. VLANs are set so that FI-A has one and FI-B has another; the Nimble's tge1 ports run to FI-A and tge2 runs to FI-B. We use VMFS datastores and NCM multipathing, which VMware reports is fully functional. I haven't been connected to our ESXi hosts during a failover so I'm not sure what they report for their paths during the failover. We use the software iSCSI client and have one vmnic bound to one vmkernal port per iSCSI VLAN.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2015 06:40 AM
03-25-2015 06:40 AM
Re: A disruptive non-disruptive failover?
Never hurts to start a case with support.
What address zone are you using? (Single Zone, Bisect, Even/Odd)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2015 06:56 AM
03-25-2015 06:56 AM
Re: A disruptive non-disruptive failover?
Hi Alan,
I've performed several controller upgrades (physical) and software. Typically you will see a pause in IO as the controllers transition from the Active to Standby controller. If you were pinging the management interface you may see a dropped packet, typically when viewing ESX or a Guest you will see a pause in the IO activity for a few seconds. More often or not this is around 5-10 seconds but it recommend setting SCSI timeout to 120 seconds (which is best practice for pretty much all storage vendors).
This is also the value that NCM will set for Windows/ESX.
valdereth advice is good - if your seeing unexpected behaviour then call to support should be made so they can check over your configuration. I doubt address zoning will come into it if your plugged directly into Fabric Interconnects of your UCS as this really only comes into play when there are inter-switch links in the mix
Cheers
Rich
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2015 07:21 AM
03-25-2015 07:21 AM
Re: A disruptive non-disruptive failover?
Support is my next stop but I was curious to see if anyone else would have noticed the same behavior. I do expect an IO pause during a failover while everything is re-learned, and we used to have those short pauses, but now it's a couple of minutes and noticeably causes servers to stop responding.
I'm using single zones since we split into two distinct subnets on two switches.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2015 07:27 AM
03-25-2015 07:27 AM
Re: A disruptive non-disruptive failover?
That'd be my lack of UCS experience showing through
The longest period I've where seen dropped pings to the discovery addresses was probably only 5-10 seconds during a software update. This has always been during low I/O periods - I'm not sure if timeouts increase during intense I/O.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2015 08:36 PM
03-25-2015 08:36 PM
Re: A disruptive non-disruptive failover?
Hi mate
Would be great if you could post any answers you receive from support here. I have much the same setup as you, but I have only done upgrades during our PoC phase, not when running in full production mode.
Regards,
Marty
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2015 11:08 AM
03-29-2015 11:08 AM
Re: A disruptive non-disruptive failover?
Hi all.
I might be onto something but it will take until the next software upgrade to confirm if I've got it fixed. While at the NIOP course it got me thinking to check a setting that changed in NOS 2.x. It turns out it hadn't been set right after the upgrade (my bad). I tried a test failover today and it was as smooth as expected; a software upgrade will be the real challenge. I'll post back as soon as that's done.
Also, Marty, to your request:
Know that Nimble's failovers are supposed to be transparent, and I've seen them work that way personally in the past. Nimble's InfoSight metrics still show well over half of their customers performing software upgrades during business hours. If you haven't seen the blog post from November check it out: Nimble Storage Blog | Go Ahead – Update Your Storage Operating System in the Middle of the Day. Those numbers are still tracking from what I'm heard and I'll be back at midday upgrades once I get this bug worked out.
Alan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-30-2015 06:02 AM
03-30-2015 06:02 AM
Re: A disruptive non-disruptive failover?
What setting did you change?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-30-2015 09:30 AM
03-30-2015 09:30 AM
Re: A disruptive non-disruptive failover?
VMware's discovery IPs listed only the array management interface, as per the 1.x guides. They were never updated to use the two new discovery IPs that were configured with 2.x. So, I removed the management address and added the Nimble discovery addresses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-08-2015 05:27 PM
05-08-2015 05:27 PM
Re: A disruptive non-disruptive failover?
After updating to 2.2.6.0 today it appears I still have a problem. Our network took a brief pause during the update and vCenter logged an "all paths down" event for every Nimble datastore on each of our hosts. The outage last over two minutes as indicated by other logs that note the event has been over 140 seconds long and the hosts are switching to I/O fast fail mode. Looks like I'll need to open a support ticket.