Re: Maximum cached read IOPS? ... to validate host network path config.

davidbaril127 · ‎11-29-2016

I have a new Nimble CS-1000 connected via 10GbE and Jumbo frames

I am in the process of validating host-side configuration settings.

I ran a simple read test repeatedly re-reading the same 4Kb block using unbuffered direct IO.

The goal of the test is to read directly from the Nimble controller cache to test network connections and network latency.

Ideally, on a tight network with low latencies, the performance would approach wire speed, as the IO size increased as the round trip network latencies would be amortized across larger requests. This also tests that the large TCP window sizes are working properly.

I am getting unusually low values ... of only about 2700 - 3000 4kb reads from cache of the SAME data.

I've seen the read IOPs increase to ~ 3500 when I increase the read size to 8kb ... this could be explainable given Nimble's preference for 8kb page size. If I increase the read size (from same locations) to 1MB or more, the IO rate approaches 350 MB/sec single-threaded with no overlap or read-ahead ... which is a reasonable rate.

Iscsi immediate data max size is 32kb.

Centos 7.2 is the host OS, running the bundled open iSCSI stack.

I've gathered host-side network statistics at the Ethernet and TCP layers, and for a 1000 IO test, 1002 to 1005 packets are sent and received. No retransmits, no delayed acks. I even have a latency plot, which shows a few outliers, but otherwise it looks fairly consistent, except that the round trip time is 0.3 to 0.4 milliseconds.

with Linux "ping" reporting min/avg/max/mdev of 0.224/0.274/0.431/0.058.

The Linux host is running as a VM under VMware ESXi 6.x on a relatively idle system, using the vmxnet3 paravirtual NIC.

I'm running dual-fabric 10GbE for iSCSI, on separate subnets. The Nimble iSCSI volume mounted as an external ISCSI LUN.

For this initial testing, I am only using a single 10GbE path.

The network path is vmxnet3 virtual NIC => VMware virtual switch => ESXi physical 10GbE NIC => External 10GbE Switch => Nimble.

I am not that familiar with iSCSI network latencies, especially in a virtualized environment. I have extensive experience with 10GbE and faster networking with non-virtualized environments (non-iSCSI), and low-latency Fibre channel (non-virtualized, with only a single-hop)

If you increase the number of threads and introduce multipath, some unexpected curves show up:

Let me emphasize ... these synthetic tests are designed to read from Nimble memory cache ... to stress the network connection.

I am running "noop" IO scheduler, on both block devices and the multipath pseudo-device.

My gut feel is that I am experience some additional latency caused by some form of interrupt moderation, large receive offload, or host TCP stack coalescing latency ... but this test results in a single packet transmit (the Read request), and a single packet receive (the data from the Nimble). Since this artificial test is unbuffered and synchronous, there is no opportunity for coalescing.

The latency chart shows that there are no massive spikes, that might occur due to delayed acks or packet retransmits. The network statistics confirm no retransmits or delayed acks.

This is likely not a Nimble-issue per-se, but I was hoping that others in the community may have experienced similar behaviors and have identified what host or ESXi configuration setting was adding the additional latency, easily seen with "ping".

I will admit I have not gone through and implemented every procedure identified in the VMware best practice guide for low-latency operation, but on a relatively idle large ESXi host I was not expecting that it would make that much of a difference ... at these performance levels.

Are these iSCSI latencies representative or am I overlooking some configuration setting?

Thanks for your help.

Dave B

sdaniel47 · ‎11-30-2016

Dave --

Interesting graphs! Thanks for sharing.

Your top line question, why only 2.7K IOPS when reading cached data, appears to be why only that fast when reading cached data using a single thread, correct?

If you are only using 1 thread, then the speed in IOPS times the response time in seconds will always be 1. In your case, 2700 * 0.00037 = 1. Which is to say you are averaging about 0.37 milliseconds of response time. Another way to look at this is your single thead waits for 0.37 milliseconds 2,700 times per second.

In order for it to do more reads, i.e. wait more times per second, the wait time will have to come down.

Your ping time looks to be about 0.27 milliseconds. If we take that is the network overhead, looks like your CS-1000 is only adding another 0.1 milliseconds. This seems pretty good to me! I think you getting the expected result.

As your graphs show, if you increase the number of threads you start getting much higher throughput.

I don't understand why 8K is better. What is your volume block size?

I also don't understand why your red graph jumps up sharply at 8 threads. Perhaps that is a function of how your multipathing is set up.

davidbaril127 · ‎11-30-2016

Hi Stephen,

Thank you for your feedback. It is useful to receive some confirmation ... and I agree that with ping times in the 0.27 millisecond range (which is not great), it will ultimately pace this single threaded synthetic test.

As to the better performance at 8kb IO size (about 3500-4000 cached read IOPs @ 8KB vs about 2400-2700 cached read IOPs at 4kb) ... I have a few ideas.

First, I did confirm that the Nimble page size for the volume is set to 4KB, which is appropriate for general Linux file sharing. My eventual XFS file system will have a 4kb block size.

I did confirm that there is a significant drop in performance for IO sizes less than 4kb ... but that is somewhat expected given the Nimble architecture and best practice guidance. I did not want to get overly involved with providing too much details ... and get bogged down, but here is a quick illustration or memory cached IOPs vs IO size.

After further thought, I now believe one of the contributing factors of the 8kb IO efficiency is the fact that we are using Jumbo frames (and the latency penalties), and some virtualized CPU scheduling biases.

Since I am not on a completely idle system, my results vary a bit, and I need to take care to run all the different options within the same small time frame. Running a 4kb test now, and a 8kb test in 20 minutes leads to sample skew. Running the test variations back-to-back improves the consistency.

When I re-run my tests ... the 8kb test performance relative to the 4kb test varies. Some times it is better, some times it is slightly slower. I can't control the environment enough on this semi-idle system.

The 4kb read case results in a single packet send and single packet receive. I have confirmed this by gathering statistics at the iscsi, netstat, and Ethernet layers ... capturing before and after snapshots and computing the deltas.

The 8kb read case also results in a single packet send and single packet. However, the "send effort" (sending the ISCSI read request) for the all IO sizes up to the ISCSI max receive size is exactly the same. One packet of fixed size.

The "receive effort" for the 8kb case has the same number of packets (since the larger packet still fits in the Jumbo frame) ... but the data payload is larger by the additional 4kb.

So the effort for the 8kb case is NOT what you might first expect. In terms of packets and round trip latency, they are the same ... 1 short packet (read request) and 1 long packet (received data) per iteration. Only the received data portion gets larger for the 8kb test.

However, I believe that Virtual CPU scheduling is a major contributing factor. If you are not doing much work (since you are waiting due to network latency), you are more likely to be temporarily scheduled off a CPU, and re-scheduling you back on to the CPU adds latency. As you perform more work, VMware understands that you need more CPU, and is more likely to keep you on the CPU and switch you off less frequently. So even if you are doing more work (and should take longer), you may have less virtualization-induced latency than the 4kb case which presents a lighter load.

Another variable, depending on how the speed-step-like settings are configured, the CPU is capable of entering a reduced power state in the space between packets (the inter-frame gap), at a 10GbE line rate. If this reduced power (or speed) state is entered, it takes some amount of time to get back up to full speed.

So I have more homework to do on validating how the system is configured from a power-saving standpoint. This is known to add latency for high speed IO handling. In my environment, there are 4 power/speed layers ... the "hardware" (BIOS) level settings, the hypervisor settings, the settings of the Guest VM, and the Linux settings within the Guest VM.

I was able to capture some "scheduling" behavior on one of my 8kb runs. Notice that the latency is substantially higher for the first ~200 IOs, and then settles down 0.35 millisecond range seen in the 4kb test. If I run the 8kb test again, I might not see this 2-tier behavior.

Thank you again for your comments. They validate that if I am getting 0.24-ish millisecond ping latencies ... I am runing reasonably, which is what I believed.

I'm still hoping that someone will comment that they are seeing much better or similar ping latencies on their configuration. If they are virtualized and experiencing much better latencies, I would want to learn more.

Dave B.

sdaniel47 · ‎11-30-2016

Thanks for the updated data. I'll keep watching this thread to see what others contribute.

chrisfoster42 · ‎11-30-2016

Hi Dave,

Nothing special just a quick dirty test from a small production ESXi host (with multipath) and came back with round-trip min/avg/max = 0.252/0.272/0.286 ms. Single hop, 10GbE/Jumbo to a CS-1000.

Hope that helps.

davidbaril127 · ‎11-30-2016

Hi Chris,

It helps very much.

Thank you very much for taking the effort to run a quick ping test. It helps confirm that my ping times are the same ballpark. I do like the consistency of your results ... the min/avg/max are tightly grouped.

The small latency impact is not a bad tradeoff for the benefits of virtualization.

Thank you again for the feedback.

Dave B

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Maximum cached read IOPS? ... to validate host network path config.

Maximum cached read IOPS? ... to validate host network path config.