HPE StoreVirtual Storage / LeftHand
cancel
Showing results for 
Search instead for 
Did you mean: 

P4300 G2 complete meltdown when switch comes online

Matthew D
Frequent Advisor

P4300 G2 complete meltdown when switch comes online

We have a single site 2-node P4300 G2 SAN that we are using with three ESXi 4u1 hosts. The IP SAN infrastructure is two 3550-12T switches physically isolated from our production network and connected to one another with a two-port etherchannel trunk. We have an FOM installed as well.

While stress testing the system we found that individual link failures were handled okay. If we powered down a switch, that was handled okay, too. After we power up that down switch the SAN goes crazy. Communication is lost and the CMC reports no quorum. Precisely 12 minutes after the switch comes back online quorum is restored and the SAN functions again.

Support has not been terribly helpful, although I have to admit this seems like a particularly unusual problem.

We have adjusted the STP timers so that convergence occurs in 6 seconds but that doesn't seem to help during a switch reload. It did help to eliminate problems when pulling the trunk cables.

We think the problem might have something to do with the physical links being up but not passing traffic while the switch is booting. We are really stumped here and this is our last hurdle before putting this into production. It's a rather major stability issue so it needs to be addressed. Has anyone seen this before? Does anyone know exactly what the SAN/iQ algorithm is for dealing with communication problems? The documentation mentioned a 15 second timeout of some sort but didn't explain. We appreciate any help or ideas.
20 REPLIES
Bryan McMullan
Trusted Contributor

Re: P4300 G2 complete meltdown when switch comes online

I think the 15 second timeout you are seeing is related to the VIP failover process. When the VIP goes down, SAN/iQ will take up to 15 seconds to elect a new VIP. Unfortunately, that sometimes causes significant issues.

Could the issue you are seeing be related to MAC flapping? We had that issue when we tried to set ESX up using trunks on the SAN site (big no-no, but my Network Engineer would not listen).
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

The first thing we thought was that there was MAC flapping that swamped the switch and brought down the works. Being able to ping each device while the SAN was down for those 12 minutes initially ruled that out and then to make sure we watched the console on both switches. No flapping.

We've since hammered this out with pen and paper (that's how you know we are desperate!). What we think is happening is that when the switch goes down, ARP times out and the NSM's send out new ARP replies with only the MAC of the interfaces that are up. Everything works fine. When the switch is turned back on, but before the switch has finished booting (about 2 minutes), all links on the NSM's are up but the switch isn't passing traffic yet. We are assuming that the NSM's are now sending ARP replies with both MAC addresses (we are using ALB bonds) which is confusing all devices involved including the NSM's, FOM, and ESXi hosts, because the MAC address of the interfaces connected to the switch that's booting isn't passing traffic yet so it will appear as unreachable but the link is up so it still gets advertised.

Even still... that really only explains a 2 minute plus 15 second downtime. We get 12 minutes of downtime each and every time we do this test. I don't understand what SAN/iQ is doing during those 12 minutes.

Is there a way to manually force a quorum? What would be really great is if SAN/iQ had a beacon probe feature like ESXi for detecting dead links. That would probably solve all of our problems.

Anyone out there willing to cycle their storage switches to see what happens?? hahaha. Thanks for the feedback!
Bryan McMullan
Trusted Contributor

Re: P4300 G2 complete meltdown when switch comes online

Last month I did a IOS update on 1 of my two 3750 stacks (redundant stacks using ALB on the nodes). I've got a total of 40 nodes running on it. Brought one complete stack down, no problem. Brought it back up...no problem. Other stack down, no problem. Then back up, again no problem.

You could see if it is the ALB bonds by going to Fault Tolerance only. I believe ALB uses both to send, but only one to receive with the masked MAC. That would still point to MAC flapping.

How are your ESXi machines configured for their multiple iSCSI NICs? If you don't see anything in the switch logs, it could be a client side issue. I'd get a test machine and throw wireshark on it and monitor the traffic, you should easily be able to see what's going on there.

Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Our ESXi hosts each have two kmkernel's bound to the iSCSI software adapter and assign to just one physical NIC. This was described in a best practice whitepaper from HP. As far as the SAN is concerned, each ESXi host maintains two iSCSI connections (each ESXi host has two iSCSI IP addresses). We think the SAN's ALB uses ARP to advertise different NIC's to different IP's and receive traffic on both NIC's as long as the endpoints are different (each TCP connection peaks close to 1Gbps under extreme load). We've seen close to 2Gbps performance with this setup and except for this particular situation all other failures we've simulated have been handled in under a second. ESXi sees two distict paths for each iSCSI datastore.

I guess we will try removing ALB to see if that helps but I'm not thrilled at the prospect of being forced back down to 1Gbps.

Question for you - while your stacks are booting up are the links up or down? On my 3550-12T's the links come up immediately - 2 minutes before the switch passes traffic. On my 3560 series switches I'm pretty sure the links don't come up until the booting process has completed. I wish we had some other gigabit switches to test with but we don't.

I should also mention that portfast is turned on for all of these iSCSI endpoint ports.

Thanks SO MUCH for your insight and quick responses.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

I forgot to mention also that we are taking ESXi out of the equation while we're troubleshooting this issue. We are only concerning ourselves with the CMC. The FOM and both storage boxes are all loosing communication and loosing quorum and until that's fixed we figure it's a lost caues concerning ourselves with ESX.
teledata
Respected Contributor

Re: P4300 G2 complete meltdown when switch comes online

Matthew,

What if you turned portfast OFF for the SAN connections, does that change the behavior at all?

I know you should be able to run with it on, but just wondering if that has an affect.
http://www.tdonline.com
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Thanks for the suggestion, Teledata. We disabled portfast on all the iSCSI ports and tested again. Unfortunately the results were the same. We did collect a bit more information this time.

I can now confirm that the SAN is trying to send data out the links before the switch has booted. As soon as the switch powers up we see the link lights and the flicker of activity from the SAN - even though the switch is still booting and going through the usual startup diagnostics. Exactly when the power up starts is also when communication is lost between all nodes and FOM. When the switch finishes booting we regain communication and find the SAN in a lost quorum state. This time we had one NSM online and one NSM with a down manager. Sometimes it's both. After 12 minutes, problem fixes itself and our VM's come back online.

During the time that the switch is powered down, everything is fine. Boy am I confused! I'm about ready to replace these 3550-12T's with HP 2824's because I've got nothing left to go on except that these switches just don't play nice.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

My next test is going to down a switch and then physically disconnect the links between the storage nodes and that offline switch prior to bringing the switch online. After the switch has fully booted I will plug them back in. If there is no downtime during this test then I think it's got to be the switch's behavior that's causing the problem. Is anyone using HP 2824 switches and can verify that links come online after bootup and not before? I assume that the 3750's bring their links online after bootup but that'd be nice to know, too. Thanks for all the feedback. You guys have been much more helpful than support (although I have to admit Lefthand support has been overall extremely good to us).
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

GETTING CLOSER...

We did a test where we powered down one switch and then unplugged the two SAN connections before powering up the switch and then plugged them in again once the switch booted. This had the same result.

We did a second test where we powered down one switch and unplugged both the SAN and ESX iSCSI connections from the switch and then reconnected after the switch had booted. This resulted in ZERO DOWNTIME and no quorum errors.

Could these problem be related to my ESX hosts then? I'm wondering how my ESX hosts could be wrecking communication between the storage nodes and FOM but it seems like that's what's happening - otherwise I wouldn't have to unplug the ESX hosts before powering on the switch, right?
Bryan McMullan
Trusted Contributor

Re: P4300 G2 complete meltdown when switch comes online

My stack are Cisco 3750's and the links come up after the switch fully boots (which I believe is how things are supposed to work). I guess you could be link up, admin down...but that doesn't sound right.

Have you verified that there are no firmware updates for your switches? To me, it sounds as though the switches are acting funky.

It does sound like unplugging before powering up the switch would work. But it's still not the correct function of the switch. You shouldn't be link up/admin up until the switch is ready to pass traffic.

Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Bryan - I'm looking at buying 3750's to just be done with this mess. This way I can also use 802.3ad on the SAN storage nodes (right?). I'm curious as to which model you are using and how single switch faults are handled (I'm never used stacking switches in Ciscoworld... only the 4000 and 6000 chassis style). Thanks for all your help man.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

One of the iSCSI ports has a huge output failure. I'm confused as to why flow control is enabled (and activated) but there were still output buffer failures and no PAUSE frames sent.

GigabitEthernet0/4 is up, line protocol is up
Hardware is Gigabit Ethernet, address is 0009.4494.4784 (bia 0009.4494.4784)
Description: iSCSI
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 8/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s
input flow-control is on, output flow-control is on
ARP type: ARPA, ARP Timeout 04:00:00
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue :0/40 (size/max)
5 minute input rate 26000 bits/sec, 20 packets/sec
5 minute output rate 33076000 bits/sec, 2883 packets/sec
13221748 packets input, 2709740682 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 681158 pause input
0 input packets with dribble condition detected
323344807 packets output, 834052607 bytes, 94316 underruns
0 output errors, 0 collisions, 1 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
94682 output buffer failures, 0 output buffers swapped out
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

I wanted to update everyone on how I'm doing with this issue.

I replaced the 3550-12T switches with a pair of HP 2900-24G's, stacked with a pair of 10-gig CX4 trunks. Similar configuration as before except RSTP is completely disabled. Jumbo frames still disabled, flow control still enabled (and active) on all iSCSI ports. The HP switch specs far exceed the older 3550's and we noticed a measurable (but not substantial) improvement in I/O performance.

The first problem is now completely fixed. The new switches do not bring ports online before booting is complete and I can restart a switch without affecting communication.

The second problem we thought was fixed but it appears to just be greatly reduced. After about 2.5 weeks on the new switches it appeared that we had solved the problem with spontaneous quorum loss, but this morning it happened again. Four hours after our backup window, early in the morning, at a time with very little I/O activity - the FOM and NSM's reported no communication, quorum was lost, then everything came up in a degraded state a few seconds later. I had to cycle power on one of the NSM's to regain quorum.

I guess my next step is getting back on the phone with HP support. Last time they looked at my logs in depth and found that it must be a network issue.
ACHCHGUY
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Hi Mathew,

I had a quick word to our network gurus

( I am not a network guru so I hope i explained the issue well to them and explain there answer back).


You can not do a fail over solution you are trying to do on that model of switch unless using the stacking cable out the back.


The only way you can use utp and create trunks is each p4000's nics must be in the same switch.


The way you are doing it and what you describe is what the network gurus here say would happen.

so to summarise.

You can use trunking utp cables only if each disk shelfs nics are in same switch not across the multiple switches.


If using the stack cables out the back it will ok with the p4000 shelf's nics spread across the multiple switches.

Hope that helps.

Also turn of trunk negotiation was another suggestion.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Hey Steve,

So awesome of you to check with your network guys. I really appreciate that.

We are not using trunks to the P4000's. The only trunks are between the switches (because we have vMotion ports on a different VLAN). The P4000's are using ALB mode, which according to HP's configuration guide should work the way we have it connected. I understand why it wouldn't work with LACP.

We are using a pair of 10-gig stacking cables with the new switches. I wonder if we can enable LACP now. Hrmm... something to try after hours I suppose.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

For anyone who is still following this massively long thread... thanks!

Our problems have returned after upgrading to 2900-24G switches and 10-gig trunks. Flow control enabled end-to-end and STP disabled (no need for it). Quorum still gets lost on one of the two storage nodes. We've ruled out VMware being at fault and Lefthand support continues to tell me that it must be a network problem. If I can't get something done next week we are going to move to a different storage platform. Unfortunately we just can't get Lefthand stable in our environment for whatever reason.
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Since the topic of this thread has actually been solved and the only problems that remain aren't described in said topic, I will open a new thread for the issue that remains.
JacobS_4
Occasional Advisor

Re: P4300 G2 complete meltdown when switch comes online

Matthew,
Did you finally get the problem resolved with the LeftHand nodes losing quarum? We're having a slightly different problem, but I think there could be some similarities.

Thanks
-Jake-
Matthew D
Frequent Advisor

Re: P4300 G2 complete meltdown when switch comes online

Hi Jake,

The question was answered on the second thread that was started. To make a long story short the problem went unsolved for about two months. HP then released a patch to fix a bug with the Intel NIC drivers on the G2 series hardware. The patch description sounded like it was our problem exactly. We applied the patch and the SAN has been rock solid ever since. We've since enabled round robin load balancing and the SAN has been happy to comply. Guess we didn't need new switches after all but they certainly don't hurt.

Hope this helps.

Matt
Jim Silvia
Advisor

Re: P4300 G2 complete meltdown when switch comes online

I had a similar problem. The switch wasn't showing any errors. But performance was horrible and the storage manager kept going down. This was for a new install.

Look at the logs for each node. Hardware->Log Files. There's a hist.ifconfig.log - download that check for any interface errors.

Once we found this we saw a large # of errors that the NSM nic was reporting that wasn't seen on the switch. We moved the CAT5 to new gigabit ports on the same switch (module 4208) and the newer module resolved the problem. No more errors and great performance.