StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

P4300 G2 SAS Starter frequently looses quorum

Matthew D
Frequent Advisor

P4300 G2 SAS Starter frequently looses quorum

This is a continuation of the "complete meltdown when switch restarts" thread. The problems originally cited in that thread were solved so I wanted to branch out to a dedicated thread for this issue.

I have three ESX4U2 servers and two P4300 G2 storage nodes. FOM is running in VMware. Each host has two connections for iSCSI, each to an HP 2900-24G switch. The two HP switches are connected via a pair of 10gig trunks. Flow control is enabled on every port. Jumbo frames are disabled everywhere.

The P4300 G2's are bonded in ALB mode. The ESX servers are configured as per the now famous blog post. Two vmkernel NIC's both bound to the iSCSI software adapter and round robin path selection.

The configuration has shown great performance, however, about once per week at random times of the day (usually low activity periods) ESX briefly looses storage connectivity and opening the Lefthand CMC reveals that one of the two storage nodes is offline. After power cycling the downed node and looking at all of the log files it appears that storage node two lost communication with the FOM and first node. Lefthand support tells me that my network is at fault but I have everything configured (from what I can tell) properly. I also completely changed my network equipment from the original Cisco 3550-12T's I previously used (and experienced the same problems with).

The alert files show that the second node lost contact with both the FOM and the first storage node. The FOM and first storage node both record no connectivity with the second storage node.

More recently I discovered that immediately following one of these events, the HP switches report 6 packets/sec received and 3 packets/sec DEFERRED (flow control) on just the second storage node ports. This happens constantly until I power cycle the box.

I've been troubleshooting this for literally months and I've gotten approval to sell the Lefthand gear if a solution isn't found. I really don't want to do that but I must have a stable environment. I'm sure it's something unique to our environment but I just don't see what. Any ideas?
8 REPLIES
chris huys_4
Honored Contributor

Re: P4300 G2 SAS Starter frequently looses quorum

Hi Matthew,

Log a call with HP Support and ask them to elevate the case to lvl 3 (wtec).

Greetz,
Chris
Matthew D
Frequent Advisor

Re: P4300 G2 SAS Starter frequently looses quorum

Chris - created a ticket online. After 48 hours the best I got was "we will look at the log files". We have 24x7x4 support. Our office has been closed but I'm going to go in tomorrow and call. We lost quorum twice over the weekend. I don't like driving to work all the time just to reset a storage node. The Lefthand support group seems absolutely top notch when I can actually get a hold of someone but I often have trouble getting an e-mail contact and I'm regularly on hold for 45-60 minutes when I call. Thanks for the suggestion - I will ask for this to be elevated tomorrow.
Matthew D
Frequent Advisor

Re: P4300 G2 SAS Starter frequently looses quorum

My Procurve 2900-24T switches have fault-finder's all enabled at medium sensitivity. I don't see any fault log entries but could this feature be not playing nice with the high load of iSCSI?
Joshua Small_2
Valued Contributor

Re: P4300 G2 SAS Starter frequently looses quorum

Hi,

I'm grasping at straws here but did I read in another post you had disabled spanning tree on the switches?

By all means, portfast the ports the servers and SAN connect to, and couple that with bpduguard for safety. But disabling spanning tree could well be preventing you from seeing an issue somewhere. If for example, those 10G trunks were misbehaving, who knows what the loop would do?
Matthew D
Frequent Advisor

Re: P4300 G2 SAS Starter frequently looses quorum

Joshua - I appreciate your feedback. I'm confident that there are no loops because they are in a trunk group and if the trunk group wasn't working the resulting bridge loop would spike the utilization on those ports. I'm seeing <40% utilization peak, about 3% average. Also the HP fault finder is enabled which (supposed) catches this condition.

All - support clued me in to a patch for the G2 series, patch 10078, that appears to fix this very issue (SAN NIC port suddenly dropping all packets). Last time I reached out to support the patch didn't exist. I'm installing now and hopefully this patch does the trick.
Vegard Hals
Occasional Visitor

Re: P4300 G2 SAS Starter frequently looses quorum

Hi
I'm experiencing similar problems with LH G2 SAN.

Did the patch solve your problem?

Re: P4300 G2 SAS Starter frequently looses quorum

I'd be questioning the use of Flow Control on those switches - it really shouldn't be used these days.

I'd be using some packet monitoring software with port mirror to see what the TCP traffic is doing.

Welcome to the problems of IP in the storage world. :)
Matthew D
Frequent Advisor

Re: P4300 G2 SAS Starter frequently looses quorum

Yes, the patch fixed our issue 100%. Not even the slightest hiccup since. That was definitely our problem.

With regards to flow control - we had no choice but to turn it on, Lefthand required it before bonding the gigabit ethernet ports.