StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

P4300 G2 complete meltdown when switch comes online

 
Highlighted
Frequent Advisor

P4300 G2 complete meltdown when switch comes online

We have a single site 2-node P4300 G2 SAN that we are using with three ESXi 4u1 hosts. The IP SAN infrastructure is two 3550-12T switches physically isolated from our production network and connected to one another with a two-port etherchannel trunk. We have an FOM installed as well.

While stress testing the system we found that individual link failures were handled okay. If we powered down a switch, that was handled okay, too. After we power up that down switch the SAN goes crazy. Communication is lost and the CMC reports no quorum. Precisely 12 minutes after the switch comes back online quorum is restored and the SAN functions again.

Support has not been terribly helpful, although I have to admit this seems like a particularly unusual problem.

We have adjusted the STP timers so that convergence occurs in 6 seconds but that doesn't seem to help during a switch reload. It did help to eliminate problems when pulling the trunk cables.

We think the problem might have something to do with the physical links being up but not passing traffic while the switch is booting. We are really stumped here and this is our last hurdle before putting this into production. It's a rather major stability issue so it needs to be addressed. Has anyone seen this before? Does anyone know exactly what the SAN/iQ algorithm is for dealing with communication problems? The documentation mentioned a 15 second timeout of some sort but didn't explain. We appreciate any help or ideas.
20 REPLIES 20
Highlighted
Trusted Contributor

Re: P4300 G2 complete meltdown when switch comes online

I think the 15 second timeout you are seeing is related to the VIP failover process. When the VIP goes down, SAN/iQ will take up to 15 seconds to elect a new VIP. Unfortunately, that sometimes causes significant issues.

Could the issue you are seeing be related to MAC flapping? We had that issue when we tried to set ESX up using trunks on the SAN site (big no-no, but my Network Engineer would not listen).
Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

The first thing we thought was that there was MAC flapping that swamped the switch and brought down the works. Being able to ping each device while the SAN was down for those 12 minutes initially ruled that out and then to make sure we watched the console on both switches. No flapping.

We've since hammered this out with pen and paper (that's how you know we are desperate!). What we think is happening is that when the switch goes down, ARP times out and the NSM's send out new ARP replies with only the MAC of the interfaces that are up. Everything works fine. When the switch is turned back on, but before the switch has finished booting (about 2 minutes), all links on the NSM's are up but the switch isn't passing traffic yet. We are assuming that the NSM's are now sending ARP replies with both MAC addresses (we are using ALB bonds) which is confusing all devices involved including the NSM's, FOM, and ESXi hosts, because the MAC address of the interfaces connected to the switch that's booting isn't passing traffic yet so it will appear as unreachable but the link is up so it still gets advertised.

Even still... that really only explains a 2 minute plus 15 second downtime. We get 12 minutes of downtime each and every time we do this test. I don't understand what SAN/iQ is doing during those 12 minutes.

Is there a way to manually force a quorum? What would be really great is if SAN/iQ had a beacon probe feature like ESXi for detecting dead links. That would probably solve all of our problems.

Anyone out there willing to cycle their storage switches to see what happens?? hahaha. Thanks for the feedback!
Highlighted
Trusted Contributor

Re: P4300 G2 complete meltdown when switch comes online

Last month I did a IOS update on 1 of my two 3750 stacks (redundant stacks using ALB on the nodes). I've got a total of 40 nodes running on it. Brought one complete stack down, no problem. Brought it back up...no problem. Other stack down, no problem. Then back up, again no problem.

You could see if it is the ALB bonds by going to Fault Tolerance only. I believe ALB uses both to send, but only one to receive with the masked MAC. That would still point to MAC flapping.

How are your ESXi machines configured for their multiple iSCSI NICs? If you don't see anything in the switch logs, it could be a client side issue. I'd get a test machine and throw wireshark on it and monitor the traffic, you should easily be able to see what's going on there.

Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Our ESXi hosts each have two kmkernel's bound to the iSCSI software adapter and assign to just one physical NIC. This was described in a best practice whitepaper from HP. As far as the SAN is concerned, each ESXi host maintains two iSCSI connections (each ESXi host has two iSCSI IP addresses). We think the SAN's ALB uses ARP to advertise different NIC's to different IP's and receive traffic on both NIC's as long as the endpoints are different (each TCP connection peaks close to 1Gbps under extreme load). We've seen close to 2Gbps performance with this setup and except for this particular situation all other failures we've simulated have been handled in under a second. ESXi sees two distict paths for each iSCSI datastore.

I guess we will try removing ALB to see if that helps but I'm not thrilled at the prospect of being forced back down to 1Gbps.

Question for you - while your stacks are booting up are the links up or down? On my 3550-12T's the links come up immediately - 2 minutes before the switch passes traffic. On my 3560 series switches I'm pretty sure the links don't come up until the booting process has completed. I wish we had some other gigabit switches to test with but we don't.

I should also mention that portfast is turned on for all of these iSCSI endpoint ports.

Thanks SO MUCH for your insight and quick responses.
Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

I forgot to mention also that we are taking ESXi out of the equation while we're troubleshooting this issue. We are only concerning ourselves with the CMC. The FOM and both storage boxes are all loosing communication and loosing quorum and until that's fixed we figure it's a lost caues concerning ourselves with ESX.
Highlighted
Respected Contributor

Re: P4300 G2 complete meltdown when switch comes online

Matthew,

What if you turned portfast OFF for the SAN connections, does that change the behavior at all?

I know you should be able to run with it on, but just wondering if that has an affect.
http://www.tdonline.com
Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Thanks for the suggestion, Teledata. We disabled portfast on all the iSCSI ports and tested again. Unfortunately the results were the same. We did collect a bit more information this time.

I can now confirm that the SAN is trying to send data out the links before the switch has booted. As soon as the switch powers up we see the link lights and the flicker of activity from the SAN - even though the switch is still booting and going through the usual startup diagnostics. Exactly when the power up starts is also when communication is lost between all nodes and FOM. When the switch finishes booting we regain communication and find the SAN in a lost quorum state. This time we had one NSM online and one NSM with a down manager. Sometimes it's both. After 12 minutes, problem fixes itself and our VM's come back online.

During the time that the switch is powered down, everything is fine. Boy am I confused! I'm about ready to replace these 3550-12T's with HP 2824's because I've got nothing left to go on except that these switches just don't play nice.
Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

My next test is going to down a switch and then physically disconnect the links between the storage nodes and that offline switch prior to bringing the switch online. After the switch has fully booted I will plug them back in. If there is no downtime during this test then I think it's got to be the switch's behavior that's causing the problem. Is anyone using HP 2824 switches and can verify that links come online after bootup and not before? I assume that the 3750's bring their links online after bootup but that'd be nice to know, too. Thanks for all the feedback. You guys have been much more helpful than support (although I have to admit Lefthand support has been overall extremely good to us).
Highlighted
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

GETTING CLOSER...

We did a test where we powered down one switch and then unplugged the two SAN connections before powering up the switch and then plugged them in again once the switch booted. This had the same result.

We did a second test where we powered down one switch and unplugged both the SAN and ESX iSCSI connections from the switch and then reconnected after the switch had booted. This resulted in ZERO DOWNTIME and no quorum errors.

Could these problem be related to my ESX hosts then? I'm wondering how my ESX hosts could be wrecking communication between the storage nodes and FOM but it seems like that's what's happening - otherwise I wouldn't have to unplug the ESX hosts before powering on the switch, right?