- Community Home
- >
- Storage
- >
- Midrange and Enterprise Storage
- >
- StoreVirtual Storage
- >
- Re: P4300 G2 complete meltdown when switch comes o...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 10:40 AM
тАО05-06-2010 10:40 AM
P4300 G2 complete meltdown when switch comes online
While stress testing the system we found that individual link failures were handled okay. If we powered down a switch, that was handled okay, too. After we power up that down switch the SAN goes crazy. Communication is lost and the CMC reports no quorum. Precisely 12 minutes after the switch comes back online quorum is restored and the SAN functions again.
Support has not been terribly helpful, although I have to admit this seems like a particularly unusual problem.
We have adjusted the STP timers so that convergence occurs in 6 seconds but that doesn't seem to help during a switch reload. It did help to eliminate problems when pulling the trunk cables.
We think the problem might have something to do with the physical links being up but not passing traffic while the switch is booting. We are really stumped here and this is our last hurdle before putting this into production. It's a rather major stability issue so it needs to be addressed. Has anyone seen this before? Does anyone know exactly what the SAN/iQ algorithm is for dealing with communication problems? The documentation mentioned a 15 second timeout of some sort but didn't explain. We appreciate any help or ideas.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 11:53 AM
тАО05-06-2010 11:53 AM
Re: P4300 G2 complete meltdown when switch comes online
Could the issue you are seeing be related to MAC flapping? We had that issue when we tried to set ESX up using trunks on the SAN site (big no-no, but my Network Engineer would not listen).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 12:03 PM
тАО05-06-2010 12:03 PM
Re: P4300 G2 complete meltdown when switch comes online
We've since hammered this out with pen and paper (that's how you know we are desperate!). What we think is happening is that when the switch goes down, ARP times out and the NSM's send out new ARP replies with only the MAC of the interfaces that are up. Everything works fine. When the switch is turned back on, but before the switch has finished booting (about 2 minutes), all links on the NSM's are up but the switch isn't passing traffic yet. We are assuming that the NSM's are now sending ARP replies with both MAC addresses (we are using ALB bonds) which is confusing all devices involved including the NSM's, FOM, and ESXi hosts, because the MAC address of the interfaces connected to the switch that's booting isn't passing traffic yet so it will appear as unreachable but the link is up so it still gets advertised.
Even still... that really only explains a 2 minute plus 15 second downtime. We get 12 minutes of downtime each and every time we do this test. I don't understand what SAN/iQ is doing during those 12 minutes.
Is there a way to manually force a quorum? What would be really great is if SAN/iQ had a beacon probe feature like ESXi for detecting dead links. That would probably solve all of our problems.
Anyone out there willing to cycle their storage switches to see what happens?? hahaha. Thanks for the feedback!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 12:22 PM
тАО05-06-2010 12:22 PM
Re: P4300 G2 complete meltdown when switch comes online
You could see if it is the ALB bonds by going to Fault Tolerance only. I believe ALB uses both to send, but only one to receive with the masked MAC. That would still point to MAC flapping.
How are your ESXi machines configured for their multiple iSCSI NICs? If you don't see anything in the switch logs, it could be a client side issue. I'd get a test machine and throw wireshark on it and monitor the traffic, you should easily be able to see what's going on there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 12:47 PM
тАО05-06-2010 12:47 PM
Re: P4300 G2 complete meltdown when switch comes online
I guess we will try removing ALB to see if that helps but I'm not thrilled at the prospect of being forced back down to 1Gbps.
Question for you - while your stacks are booting up are the links up or down? On my 3550-12T's the links come up immediately - 2 minutes before the switch passes traffic. On my 3560 series switches I'm pretty sure the links don't come up until the booting process has completed. I wish we had some other gigabit switches to test with but we don't.
I should also mention that portfast is turned on for all of these iSCSI endpoint ports.
Thanks SO MUCH for your insight and quick responses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-06-2010 12:56 PM
тАО05-06-2010 12:56 PM
Re: P4300 G2 complete meltdown when switch comes online
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-07-2010 06:05 AM
тАО05-07-2010 06:05 AM
Re: P4300 G2 complete meltdown when switch comes online
What if you turned portfast OFF for the SAN connections, does that change the behavior at all?
I know you should be able to run with it on, but just wondering if that has an affect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-07-2010 06:57 AM
тАО05-07-2010 06:57 AM
Re: P4300 G2 complete meltdown when switch comes online
I can now confirm that the SAN is trying to send data out the links before the switch has booted. As soon as the switch powers up we see the link lights and the flicker of activity from the SAN - even though the switch is still booting and going through the usual startup diagnostics. Exactly when the power up starts is also when communication is lost between all nodes and FOM. When the switch finishes booting we regain communication and find the SAN in a lost quorum state. This time we had one NSM online and one NSM with a down manager. Sometimes it's both. After 12 minutes, problem fixes itself and our VM's come back online.
During the time that the switch is powered down, everything is fine. Boy am I confused! I'm about ready to replace these 3550-12T's with HP 2824's because I've got nothing left to go on except that these switches just don't play nice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-07-2010 07:08 AM
тАО05-07-2010 07:08 AM
Re: P4300 G2 complete meltdown when switch comes online
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-07-2010 10:27 AM
тАО05-07-2010 10:27 AM
Re: P4300 G2 complete meltdown when switch comes online
We did a test where we powered down one switch and then unplugged the two SAN connections before powering up the switch and then plugged them in again once the switch booted. This had the same result.
We did a second test where we powered down one switch and unplugged both the SAN and ESX iSCSI connections from the switch and then reconnected after the switch had booted. This resulted in ZERO DOWNTIME and no quorum errors.
Could these problem be related to my ESX hosts then? I'm wondering how my ESX hosts could be wrecking communication between the storage nodes and FOM but it seems like that's what's happening - otherwise I wouldn't have to unplug the ESX hosts before powering on the switch, right?