NC510F + Fedora Core 8 - LSA = very bad perfs

Hi,

I'm using 2 NC510F boards on Fecora Core 8 (kernel 2.6.23.1-42.fc8) and I have very bad performances (460Mbit/s) without HP Linux Socket Acceleration.
The boards are tied together by a fiber optic cable (1m/62.5/125) and yhey have a static IP address each.
I've installed nx_nic-3.4.336-1 and nx_tools-3.4.336-1 but not the nxlsa_3.4.336-1 (it does not compile).
The network performance is mesured by "ping -q -i 0 -s 65507 IP_ADDRx" on each interface
I've only 460Mbits/s as the network bandwidth where in another system I've 722Mbits/s on a 1 gig network (with nVidia chips).
what's wrong ?
Is LSA absolutely required to obtain acceptable performances ? Or LSA is "just" needed to decresase CPU usage ?

Regards,
Steve

7 REPLIES 7

First the boilerplate questions:

In what sort of system(s) are the NC510Fs installed?
Into which slot(s)?
Are they _electrically_ x8 slots?

To which CPU(s) are interrupts from the NIC(s) being sent? grep /proc/interrupts

Next - when running this ping test is there any one CPU on either end at 100% CPU utilization?

After that, what do you get with a netperf TCP_STREAM test between the two endpoints?

What are the settings for tcp_rmem and tcp_wmem on each side?

sysctl -a | grep rmem
sysctl -a | grep wmem

Does any one CPU saturate when running the netperf TCP_STREAM test?

there is no rest for the wicked yet the virtuous have no pillows

Hello,

My answers are below :

> In what sort of system(s) are the NC510Fs
> installed?
CPU : Bi DualCore Intel Xeon32 (5138@2.13GHz)
FSB : 1033MHz
RAM : 4Go DDR2-667MHz
OS : Fedora Core 8 (kernel 2.6.23.1-42.fc8)
CPU0/CPU1 + CPU2/CPU3

> Into which slot(s)?
> Are they _electrically_ x8 slots?
PCIe x8 (please see lspci logfile in attachment)

> To which CPU(s) are interrupts from the
> NIC(s) being sent? grep
> /proc/interrupts
CPU 3

> Next - when running this ping test is there
> any one CPU on either end at 100% CPU
> utilization?
[Sender]
No, one CPU at 10%, the others are idle.
[Receiver]
No, one CPU at 20%, the others are idle.

> After that, what do you get with a netperf
> TCP_STREAM test between the two endpoints?
netperf is not installed. So I've grabbed netperf-2.4.4 from ftp://ftp.netperf.org/netperf.
Configure is OK but when I make it I get netlib.c error (undef reference to __CPU_ZERO and __CPU_SET).

> What are the settings for tcp_rmem and
> tcp_wmem on each side?
Both side, the same values:
net.core.rmem_max = 131071
net.core.rmem_default = 111616
net.ipv4.tcp_rmem = 4096 87380 4194304
net.core.wmem_max = 131071
net.core.wmem_default = 111616
net.ipv4.tcp_wmem = 4096 16384 4194304
vm.lowmem_reserve_ration = 256 32 32

> Does any one CPU saturate when running the
> netperf TCP_STREAM test?
Not performed.

Regards,
Steve

OK, so the slot is electrically x8 - good.

None of the CPU's are saturated during the test - also good.

Is CPU 3 the only CPU assigned interrupts from the NIC? IIRC the 336 bits (re)enable MSI-X support and should give you four or five IRQs associated with the interface. What sort of host system model is this, and are there any messages about msi (dmes | grep -i msi) in dmesg? There are some platforms on which MSI-X won't be enabled - I don't know the list myself though.

Being "Mr. Netperf" I'm always leary of using ping for bandwidth measures :) so getting netperf going would be goodness. WRT those compile errors, 2.4.4 got hit by another change in the UI for sched_setaffinity(). That is fixed in the top-of-trunk version. If you have a subversion client on your system(s), you can point it at:

http://www.netperf.org/svn/netperf2/trunk/

or you can just look at src/netlib.c there and back-port the change to your 2.4.4 bits. The other option is to simply comment-out the HAVE_SCHED_SET_AFFINITY (IIRC) in config.h and recompile. Netperf/netserver will lose the ability to bind to a specific CPU, but if need be we can workaround that with taskset or numactl.

When we have netperf/netserver compiled and running, if using explicit setsockopt() calls (eg -s and -S test-specific options) it will be necessary to tweak net.core.rmem_max and net.core.wmem_max to something like 2MB otherwise the setsockopt() calls will be clipped.

My "cannonical" netperf tests for a first pass would be a unidirectional TCP_STREAM test, a single-connection bidirectional bulk TCP_RR test, and a single-connection, single-byte TCP_RR test:

netperf -c -C -t TCP_STREAM -l 30 -i 30,3 -H -- -s 1M -S 1M -m 64K
netperf -c -C -t TCP_RR -r m -l 30 -i 30,3 -H -- -s 1M -S 1M -r 64K -b 12
netperf -c -C -t TCP_RR -l 20 -i 30,3 -H -- -r 1

after having ./configured netperf with --enable-burst to enable the test-specific -b option above. The -i 30,3 tells it to run at least three iterations and no more than 30 in an attempt to be (by default unless a -I option is present) 995 confident the result reported is within +/- 2.5% of the actual mean. So, each of those commands will run anywhere between 90 and 900 seconds. You can omit the -i option if you are pressed for time.

Single-stream performance is of course not _everything_ and once we get past the single-stream stuff we can discuss using netperf to measure aggregate performance. We can do that here, or perhaps better suited to the netperf-talk mailing list hosted on netperf.org. Up to you.

there is no rest for the wicked yet the virtuous have no pillows

OK,

I definitively forget ping for testing network performance. netperf is simply awesome. Thanks for this great utility. I'm now above the gig.
(but not at 10 gigs yet;)). Please find the result of the 3 tests in attachment.

Other things I've moted about my platform:
1. APCI is not turned on due to the DMI is not present.
2. when I launch nxudiag -a, only the interrupt test has failed. Do I have MSI/MSIx issues with my platform ?
3. I'm using io sched cfq.

Regards,
Steve

I am glad to see netperf running and that you like it. The numbers you are getting are not unexpected to me for a single instance of netperf running through an NC510, I have seen numbers like that before.

Being a four-core system, the 25% CPU util reported by netperf for the receiver suggests that one of the CPUs, probably the one taking interrupts, was saturated during the TCP_STREAM test.

The "single-connection, bidirectional" TCP_RR test was missing the -f m - I'm not sure what it would have done with -r m as a global option :) It may be part of why the result appears to be so low, assuming I did my sums correctly.

When I last installed "336" onto a system, one thing I forgot was to flash the NIC with nx_tools - as such there was a mismatch between firmware and driver which precluded using MSI-X. I'd remembered to install nx_tools, but didn't remember that I still had to run the flash utility. In my defense :) "ethtool" was told (IIRC by the driver) and so was telling me, that the NIC firmware was the same rev as the driver :( The way I found-out there was a mismatch was to troll through dmesg output for strings with "nx" in them:

dmesg | grep -i nx | less

I have been told that as the firmware gets rev'ed the NIC does improve performance - probably not "and then a miracle occurred" but probably still worthwhile.

WRT no DMI etc - just what model system is this again? I promise I won't stop discussion if it isn't HP, I simply want to know more about the system we are dealing with.

there is no rest for the wicked yet the virtuous have no pillows

BTW, if the application is to be back-to-back between the NICs or if the rest of the broadcast domain supports it, you can enable a larger MTU on the NC510 of up to 8000 bytes. N.B. this is 1000 bytes less than the de facto standard for Jumbo Frames of 9000 bytes, so if there is kit that supports 9000 byte JumboFrames, you will want to limit it to 8000 bytes when talking to the NC510.

If _all_ the traffic is TCP and the systems with the NC510's aren't to be used as routers or bridges then you can get away with enabling the 9000 byte MTU on the other kit - the TCP MSS exchange at the time of connection establishment will paper over the MTU difference automagically.

However, if the systems with the NC510's are to be routers/bridges, or the comms will include UDP or something that doesn't exchange MSSes like TCP everything needs to be 8000 bytes.

That will likely significantly increase the results you get on the TCP_STREAM and the "single connection bidirectional bulk transfer TCP_RR test. It is unlikely to have much effect on the single-byte TCP_RR test.

At the root of much of this is a simple observation - as far as the de jure Ethernet standards are concerned, it takes just as many CPU cycles to exchange an Ethernet frame on 10G as it did on 1G as it did on 100BT as it did on 10BT. So, if you had a system with a 1G NIC that could get 1G using 1/2 of a CPU, in broad handwaving terms, you should not expect much more than 2 gig from that system with a 10G NIC.

Now, as the NICs have progressed from 10BT to 100BT to 1G to 10G the _implementations_ have added things to make it easier on the host - features like ChecKsum Offload (CKO) or Transport Segmentation Offload (TSO) or Large Receive Offload (LRO) or multiple queue support. Jumbo frame is one of those as well since it is an implementation detail not something provided by the IEEE specs.

Those first three - CKO, TSO and LRO (and JF too) are things that will improve the performance of a single connection. The latter - multiple queue support - is something that really only kicks-in with multiple concurrent connections. NICs have also included interrupt avoidance and/or coalescing mechanisms. Those can be two-edged - improving CPU utilization for bulk transfer but sometimes at the cost of increased latency (lower single-byte TCP_RR performance).

And as if I have not digressed enough... :) If you got netperf working in a way that included a working sched_setaffinity call, you can try affinitizing the netperf/netserver to a CPU other than the one taking interrupts from the NC510. In the TCP_STREAM case that may increase the performance you see with standard 1500 byte frames as it will have two CPUs working the problem. Normally the Linux stack will tend to cause the receiving process to run on the same CPU as took the interrupt from the NIC. With multiple core processors it is probably best to bind netperf/netserver to a core in the same processor rather than a core in another processor. Which cores are in which processor is a task left to the reader :) and can perhaps be deduced by looking at the output of /proc/cpuinfo and the various ID's therein.

there is no rest for the wicked yet the virtuous have no pillows

I am curious to learn how things ended-up with your setup.

there is no rest for the wicked yet the virtuous have no pillows