4 node p4500 cluster iops and queue depth

danletkeman · ‎11-14-2012

Hello,

We have a 4 node p4500 cluster with 600gb 15k sas drives, each node has both 1gb nic's connected to differnet switches using alb. Our switches are mediocre at best and we are noticeing output drops. I know these need to be changed, but will have to wait for now.

The problem we are having is some slowness with some servers which are on the san. I did some quick tests with crystal disk mark and iometer and the most I can get out of the san is around 100MB/sec at slow time, but during peak times some servers can't even manage 10MB/sec on a disk test. As a comparison I did some tests with our local storage on our esx servers and those raid 5 arrays can do 1000MB/sec. This speed difference is very noticable when doing large file copies, migrations or things of the sort.

What I was wondering is what everyone else is seeing for iops and queue depth on there p4500's. We have around 25 vm's currently running on the p4500 separated by 3 different lun's. These vm's are anything from file servers, to management servers, to email servers. We are currently seeing around 1000-2500 iops and queue depth average around 10-20 and bursts up to 150 (this seems quite high, but it only bursts to that for a short period of time).

My next question would be would it be worth upgrading everything from 1G nic's to 10G nics and replacing the switches, as we need to anyway?

If so will we be running into queue depth problems if we switch everything out to 10G?

We do currently have MPIO configured on the esx hosts so idealy we could run up to 2Gbps if needed, but only if the switches could handle it.

Thoughts?

Dan.

oikjn · ‎11-14-2012

have you enabled jumbo frames and flow control? if you haven't, that might help with the throughput at busy times.

100MB/s is about saturated for 1Gb connections so that seems about right unless you are using some better switches to add LCAP. You mention 1000MB on your other system so that sounds like you have 10Gb already setup so it would make sense to get 10Gb for the p4500 as well.

Just curious, what are your latencies with your que around 10 and with is around 150?

Gediminas Vilutis · ‎11-15-2012

I think only monitoring SW can answer to you if upgrade to 10G will solve anything. If 1G links at the moment are not saturated, there is no point to upgrade to 10G. Do you have any switch or P4500 node port traffic/error graphing solution (like freeware cacti/mrtg or any commercial) in place?

Regarding queue depths/iops/latencies - everything depends on your usage profile (assuming that all best practices, like flow control, jumbos, etc. are met). I also assume that you have 9+ SAN/IQ which rebalances iscsi sessions between nodes (i.e. all three LUNs are terminated on different nodes on your cluster).

From my experience - I have 4 node P4500 cluster with 450G disks (RAID5 inside) on 1G network, running about 160-170 VMs on 4 LUNs (no large file servers). Queues start to build and latencies start to increase at +3-4K IOPs. R/W ratio is about at 50/50

Gediminas

danletkeman · ‎11-15-2012

I have flow control enabled all the way through from san to esx host. With all of the reading I have done everyone says not to enable jumbo frames as there is no benifit. Yes I get 1000MB on my other system because it is a local raid card with 8 drives in a raid 5 config. No switch bottleneck.

Latancy's one some of the nodes jump to to 30, or 40ms, but are mostly under .3ms

danletkeman · ‎11-15-2012

I will monitor the interfaces facing the nodes and see what the usage is like. My guess would be around 10MB/sec steady as that is what I have seen in the CMC.

Flow control is all setup, Jumbo frames are disabled as I stated above. No need for that. Every node is on the latest 9.5.1215, I believe, and yes everything is balanced between the hosts.

Thanks for the stats, that gives me a better idea on what to look for as a maximum.

I still think my biggest problem is the output drops on the switches because of the small buffers. More bandwidth would be nice but I just don't see anything going that high on a regular basis with the 25 vm's that we are running. I did rebuild the san about 5 months ago and went from LACP to ALB as some people stated that the speed differences were negligible, but I think that LACP just worked better. Main reason why I changed it to ALB was that the switches I am using are not stacked. But now I see that there is a ton of traffic travsersing the trunk between the switch which also has a lot of output drops.

What happens in my scenario is that each node and each host have one nic in each switch, and there is a 4Gig trunk between the switches. Each esx host has MPIO configure with two nic's and each nic accesses only one node/lun, but will sometimes access the node via the trunk port between the switches which also creates more output drops.

So I almost want to just drop the second switch and put everything on one switch instead and change the ALB back to LACP. Downside is I loose the switch redunancy. Either this or I replaced the switches.

oikjn · ‎11-15-2012

I thought ALB has to be on the same switch just like LCAP? I haven't tested it, but the intel proset description talks about switch port failure redundancy and not simply switch redundancy.

Jumbo frames should assist a bit if you are short on buffer memory since your packet IO will get more efficient. You might want to try loading that, but then again, if you are getting port packet errors its probably not going to help enough and an upgraded switch is in order. What switches are you using? Can you tell me what your average I/O read/write latency is (just curious)?

danletkeman · ‎11-15-2012

LACP has to be on the same switch or switch stack. ALB can be on separate switches. That is one of the major advantages of it.

I have two 3560G switches.

Lightly loaded now with 1000iops, read latency is 1.5ms and write latency is 6.5ms.

Gediminas Vilutis · ‎11-16-2012

I had very bad experience with 3560 switches - output drops started to build at about 600 mbps of traffic (i.e. at ~60% of total bandwith). And that was with more or less steady WAN traffic. At what traffic level output drops start in your case? And what fraction (droped / total packets) are droped?

Gediminas

danletkeman · ‎11-16-2012

Drops happen at 200Mbps already.

Etherchannel:

Port-channel10 is up, line protocol is up (connected)
Hardware is EtherChannel, address is 5835.d94c.0f96 (bia 5835.d94c.0f96)
Description: TR-RS-AH-3560G
MTU 1500 bytes, BW 4000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 11/255, rxload 2/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, link type is auto, media type is unknown
input flow-control is off, output flow-control is unsupported
Members in this channel: Gi0/19 Gi0/20 Gi0/21 Gi0/22
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:00, output 4w1d, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 13184927
Queueing strategy: fifo
Output queue: 0/40 (size/max)
30 second input rate 37104000 bits/sec, 4910 packets/sec
30 second output rate 184208000 bits/sec, 16069 packets/sec
24694352714 packets input, 31389835595303 bytes, 0 no buffer
Received 10130753 broadcasts (9331773 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 9331773 multicast, 0 pause input
0 input packets with dribble condition detected
33876720032 packets output, 45771907579818 bytes, 0 underruns
0 output errors, 0 collisions, 1 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

San node port:

GigabitEthernet0/4 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 5835.d94c.0f84 (bia 5835.d94c.0f84)
Description: p4500 node4 nic2
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 5/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 8408109
Queueing strategy: fifo
Output queue: 0/40 (size/max)
30 second input rate 22535000 bits/sec, 2205 packets/sec
30 second output rate 5351000 bits/sec, 790 packets/sec
15667686494 packets input, 22136147119780 bytes, 0 no buffer
Received 7 broadcasts (7 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 7 multicast, 0 pause input
0 input packets with dribble condition detected
9852653998 packets output, 11658164207537 bytes, 0 underruns
0 output errors, 0 collisions, 1 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 pause output
0 output buffer failures, 0 output buffers swapped out

Gediminas Vilutis · ‎11-19-2012

Well, with bursty iscsi traffic and almost non-existent buffer it might happen.

As far as interface stats are concerned, I can't tell much about - it seems that counters weren't cleared since reboot and now it is hard to decide if packet drops can have noticeable impact to SAN performance. Could you clear counters during peak time and look at them after 5-10 minutes? Now drop ratio is below 0.1% on both interfaces - not good, but not lethal yet. Keep in mind that every packet drop requires TCP to retransmit the packet, which cause additional latency to application. If droped packets exceed 1% during heavy usage period, I would change the switches. Or, if it is not an option, would play with cisco QoS settings.

Gediminas

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

4 node p4500 cluster iops and queue depth

4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Re: 4 node p4500 cluster iops and queue depth

Drops happen at 200Mbps already. Etherchannel: Port-c...

Re: Drops happen at 200Mbps already. Etherchannel: Port-c...