Operating System - Linux
1839294 Members
2470 Online
110138 Solutions
New Discussion

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

 
SOLVED
Go to solution
Diego Liziero
Advisor

Tigon3: checksum handoff produces corrupted packets with correct checksum

I hope to write in the right forum.

In our HP based server farm we had some troubles with network based software.
After months of debug has been found the cause:
a correct TCP/IP packet (still correct when captured in the sending host before hardware checksum) sometimes reaches the destination host with a correct checsum but with corrupted payload.

Forcing the software checksum seems to solve the problem but slows down the network transfer rate.

It seems an hardware/firmware/driver issue.

Any suggestion?

Regards,
Diego.

Sending system: HP Proliant ML 350 G3
ethernet: eth0: Tigon3 [partno(N/A) rev 1002 PHY(5703)] (PCI:33MHz:32-bit) 10/100/1000BaseT Ethernet
OS linux RHEL 3.x and 4.x (same troubles with a 2.4 and 2.6 kernel)

24 REPLIES 24
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I was going to ask if this was with TSO on or off, but if it happens under 2.4 kernels, that means TSO isn't in the picture.

I will ask which version of the tg3 driver you are running in each case.

Oh, one other question - is this with the "standard" size or "jumbo" frames?
there is no rest for the wicked yet the virtuous have no pillows
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Tcp segmentation offload has been always off.

Driver version: tg3.c:v3.27-rh (May 5, 2005) for kernel 2.6
tg3.c:v1.5 (March 21, 2003) for 2.4 (boths RedHat kernels)

MTU is always 1500 (no jumbo frames)

Here is an example (od -c and a diff -u) of a corrupted payload received with correct checksum (payload size is 556 bytes).

--- payload-original-od-c 2006-09-14 08:30:14.000000000 +0200
+++ payload-received-od-c 2006-09-14 08:30:27.000000000 +0200
@@ -1,9 +1,9 @@
0000000 200 \0 002 ( \0 \0 \0 1 \0 \0 \0 \0 \0 \0 \0 020
0000020 \0 \0 \0 \v \0 \0 \0 026 \0 \0 \0 \t \0 \0 \0 001
-0000040 177 377 377 377 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0
-0000060 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 004 \0 \0 \0 004
-0000100 \0 \0 \0 001 \0 \0 002 023 \0 \0 \0 \0 \0 \0 \0 001
-0000120 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
+0000040 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 001 \0 \0 \0 024
+0000060 \0 \0 \0 \a \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0
+0000100 \0 \0 \0 002 001 \b 260 001 \b 020 \0 \0 \0 \0
+0000120 \0 \0 \0 001 \0 \0 \0 \v 001 0 \b 020 001 0 017 240
0000140 \0 \0 \0 001 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 006
0000160 \0 \0 001 370 \0 \0 \0 004 \0 \0 \0 001 \0 \0 \0 001
0000200 \0 \0 \0 032 g a l a t e a . c o m u

Regards,
Diego.
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Interesting that is is more than just a single-bit error or the like - that a number of bytes become all-ones is interesting. If it were some bad memory on the NIC I'd expect (perhaps naively) to see single bit errors, not such large multi-bit errors.

tg3 has revved a bit since those versions. If the corruption still happens after getting on to the latest versions I'd suggest calling it in as a hardware problem.
there is no rest for the wicked yet the virtuous have no pillows
michael chan_4
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Did you solve the problem by disabling tx checksum on the sender or rx checksum on the receiver?
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

> Did you solve the problem by disabling tx checksum on the sender or rx checksum on the receiver?

As stated before, disabling rx hardware checksum on the receivers can't help, as the checksum is really correct despite the corrupted payload. With hardware tx checksum disabled on the sending host we are not able to reproduce the issue, but the bit transfer rate is decreased.

> Interesting that is is more than just a single-bit error or the like [..]

Yes, I though of something like a buffer overflow, I'll have a look at the nearby captured packets to see if I can find the corrupted pattern.

> tg3 has revved a bit since those versions

I'll try the latest on psp-7.60.
Mark Partain
Occasional Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I am attaching a script file that can tell me about your networking setup.

Please download the script nic.sh to the linux machine and do the following at the command prompt logged in as root:
# dos2unix nic.sh
# sh nic.sh > /tmp/nic.log

Attach the nic.log so I can investigate the network settings and information.

Mark
If it ain't broke, I'm not done yet!
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Mark Partain> Attach the nic.log so I can investigate the network settings and information.

Here it is.

rick jones> tg3 has revved a bit since those versions. If the corruption still happens after getting on to the latest versions I'd suggest calling it in as a hardware problem.

Tested with the latest tg3.c:v3.58c (May 19, 2006) and the corruption before the hardware checksum happened again.

Sure it can't be a driver/firmware issue?
I mean, it happens during heavy load (the server is the tape backup server and the corruption happens while it sends packets to the clients during the backup process)

Regards,
Diego.
Mark Partain
Occasional Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Thanks for log file.

Is it posible to either swap eth0 with eth1 or use one of the four ports on the sundance card to try?

I did not see anyting that suprised me in the log file.

Where are you capturing the information of the checksup issue at?

Using ethereal on the system under investigation will show checksum problems due to the hardware stripping the checksum off since it is a hardware performed task by default. To alleveate this, a sniffer must be put in line for the capture. A quick check can be used to see that the checksum is enalbed with "ethtool -k eth0".

If swapping the eth's shows the problem still present on eth0, then I would have to agree with the earlier statement that it is a hardware issue and the mother board ethernet port has a issue.
If it ain't broke, I'm not done yet!
Mark Partain
Occasional Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Another thing that can be tried, is a bcm5700 driver instead of the tg3. Here is a link to the bcm5700 driver..

http://h18000.www1.hp.com/support/files/server/us/download/24497.html
If it ain't broke, I'm not done yet!
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Sure, it could still be a driver/firmware issue, but I don't want to discount the possibility of it being hardware. The suggestion to try an alternate port as part of diagnosis is a good one. Trying the bcm5700 driver is also an interesting idea, although IIRC the decision was made some time ago that tg32 was going to be replacing bcm5700.

This may have been covered already, if so please forgive me - when you say that forcing CKO to be off "solved" the issue, do you mean that the corruption stopped entirely, or that it was caught by software checksums and not passed along? Are you able to capture similarly corrupted packets with CKO disabled?

Is the corruption always in the same place(s) in the packet?
there is no rest for the wicked yet the virtuous have no pillows
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Thanks for your replies.
I swapped the two tg3 ethernet ip & cables, I'll tell you what happens.

As regard the capturing process, well, I start tethereal one minute before the backup process on the backup server (where the corruption happens) and on all linux backup clients.
I didn't write a program to check all the packets, as usually I can find them because the backup agent complains with messages like "invalid data", "data corruption" or sometimes it crashes completely.

This usually happens during the initial dialog of the backup process, and more often during the busy weekend full backups.

At this point it easy for me to search the packets immediately before the error (or the crash) in boths sides of the communication to find the corrupted payload.

Probably with software checksum the corruption is caught by the checksum at the receiving host.

Regards,
Diego.
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Mark Partain> Using ethereal on the system under investigation will show checksum problems due to the hardware stripping the checksum off since it is a hardware performed task by default.

When hardware checksum is enabled (the default with tg3 driver) the packets captured on the sending host have not the checksum computed as it will be done by the ethernet hardware, but the payload is always valid, at the receiving hosts, sometimes the packets captured have a corrupted payload but always with correct checksum computed by the hardware of the sending host.
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

If packets are still being corrupted with CKO off, and being caught by the software (host) checksum, then you would see checksum errors being incremented in the output of netstat -s -t
there is no rest for the wicked yet the virtuous have no pillows
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I found that someone else had similar troubles with tg3.
See the link below (unfortunately no diagnosis).

http://bugs.centos.org/view.php?id=1121

By the way using the tg3 at Gb speed (instead of 100Mb) seems at least to reduce the frequency of the bug.

Anyone has any idea on how I can test the eth without having to wait for a crash of our network software? I was thinking about transferring big checksumed random files, but this way it's hard to find the exact packet where the corruption happens.

Anyone knows if there are already some network testing programs that check the data besides the tcp checksum, without having to write one myself?

Regards,
Diego.
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I'd offer netperf, but it doesn't actually verify contents. It does though allow you to put a pattern in the data with the -F option. http://www.netperf.org/
there is no rest for the wicked yet the virtuous have no pillows
michael chan_4
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I think this sounds more like a data corruption problem than a checksum problem.

With software checksum, the checksum is calculated as the socket data is copied into kernel space. With hardware checksum, the checksum is calculated by hardware a lot later when the data is DMA'ed into the NIC buffers. If the data is corrupted sometime in between, the hardware will calculate the "correct" checksum on the corrupted data. That may be why turning off tx checksum offload works.

So I suggest copying files with a known data pattern and see if we can spot some trends on how the data is corrupted.
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I could't find program that checks the content of the payload besides the tcp/ip checksums.
I wrote a really simple echo server and two clients sending 32bit little endian long from 0 to INT_MAX.

Unfortunately this way I have no idea on where exactly the corruption happened in the packets. (maybe I have to play with TCP_CORK to simulate a SOCK_SEQPACKET)

After one night at 1gbit/sec here are two corruption patterns:

--- sent1 2006-10-02 07:37:48.000000000 +0200
+++ received1 2006-10-02 07:38:13.000000000 +0200
@@ -4,8 +4,8 @@
076 154 070 079 077 154 070 079 078 154 070 079 079 154 070 079
080 154 070 079 081 154 070 079 082 154 070 079 083 154 070 079
084 154 070 079 085 154 070 079 086 154 070 079 087 154 070 079
- 088 154 070 079 089 154 070 079 090 154 070 079 091 154 070 079
- 092 154 070 079 093 154 070 079 094 154 070 079 095 154 070 079
- 096 154 070 079 097 154 070 079 098 154 070 079 099 154 070 079
- 100 154 070 079 101 154 070 079 102 154 070 079 103 154 070 079
+ 046 157 230 014 047 157 230 014 026 150 230 014 027 150 230 014
+ 028 150 230 014 029 150 230 014 030 150 230 014 031 150 230 014
+ 032 150 230 014 033 150 230 014 034 150 230 014 035 150 230 014
+ 036 150 230 014 037 150 230 014 038 150 230 014 039 150 230 014
104 154 070 079 105 154 070 079 106 154 070 079 107 154 070 079
108 154 070 079 109 154 070 079 110 154 070 079 111 154 070 079
112 154 070 079 113 154 070 079 114 154 070 079 115 154 070 079
--- sent2 2006-10-02 07:39:43.000000000 +0200
+++ received2 2006-10-02 07:40:06.000000000 +0200
@@ -4,8 +4,8 @@
140 087 007 094 141 087 007 094 142 087 007 094 143 087 007 094
144 087 007 094 145 087 007 094 146 087 007 094 147 087 007 094
148 087 007 094 149 087 007 094 150 087 007 094 151 087 007 094
- 152 087 007 094 153 087 007 094 154 087 007 094 155 087 007 094
- 156 087 007 094 157 087 007 094 158 087 007 094 159 087 007 094
- 160 087 007 094 161 087 007 094 162 087 007 094 163 087 007 094
- 164 087 007 094 165 087 007 094 166 087 007 094 167 087 007 094
+ 174 062 064 007 175 062 064 007 154 055 064 007 155 055 064 007
+ 156 055 064 007 157 055 064 007 158 055 064 007 159 055 064 007
+ 160 055 064 007 161 055 064 007 162 055 064 007 163 055 064 007
+ 164 055 064 007 165 055 064 007 166 055 064 007 167 055 064 007
168 087 007 094 169 087 007 094 170 087 007 094 171 087 007 094
172 087 007 094 173 087 007 094 174 087 007 094 175 087 007 094
176 087 007 094 177 087 007 094 178 087 007 094 179 087 007 094

It seems that at some point old data from other packets is copied to the ethernet before the hardware ckecksum.
michael chan_4
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

In both cases, 64 bytes of data are corrupted with some old data. Is your cache line size 64 bytes? This is odd because bugs in the driver or hardware typically do not corrupt just a cache line. Is your system using IOMMUs or things like that?
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

The ethernet is the one on the mainboard of a HP Proliant ML 350 G3.

After a week of tests I can say that:
- when I send a payload that fits exactly one tcp packet, the error never happens.
- When I send a payload that fits one packet and 32 bit of a second one, the error happens for 64 bytes after 32 bytes on the full packet with data of previous packets.
- When I send a payload that fits one packet and 192 bit of a second one, the error happens for 64 bytes after 192 bytes on the full packet with data of previous packets.

And so on.

As the error happens only on the mainboard ethernet, I'm going to call support to get another mainboard, unless some of you has a better idea.

Thanks for your help.
Regards,
Diego.
michael chan_4
Advisor
Solution

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

This may be a kernel MMU bug. Have you tried the latest 2.6.18 kernel?
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Ok, just booted with vanilla 2.6.18, compiled with the .config of RHEL5-beta, I'll let you know what happens.

By now there is a strange warning when cciss module is loaded, but this is another unrelated thing:

HP CISS Driver (v 3.6.10)
ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 24 (level, low) -> IRQ 193
cciss0: <0x46> at PCI 0000:02:02.0 IRQ 193 using DAC
blocks= 569006235 block_size= 512
heads= 255, sectors= 63, cylinders= 35419

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
Inexact backtrace:
[] show_trace+0xd/0xf
[] dump_stack+0x17/0x19
[] __lock_acquire+0x8a8/0x900
[] lock_acquire+0x5e/0x7e
[] _spin_lock_irq+0x1f/0x2b
[] wait_for_completion+0x2b/0xdd
[] sendcmd_withirq+0x101/0x24e [cciss]
[] cciss_read_capacity+0x32/0xa6 [cciss]
[] cciss_revalidate+0x9a/0x118 [cciss]
[] rescan_partitions+0x15d/0x180
[] do_open+0xae/0x309
[] blkdev_get+0x7c/0x87
[] register_disk+0x114/0x12f
[] add_disk+0x2e/0x3a
[] cciss_init_one+0x498/0x510 [cciss]
[] pci_call_probe+0xd/0x10
[] __pci_device_probe+0x38/0x45
[] pci_device_probe+0x21/0x36
[] driver_probe_device+0x53/0x96
[] __driver_attach+0x72/0x9a
[] bus_for_each_dev+0x48/0x5d
[] driver_attach+0x14/0x16
[] bus_add_driver+0x68/0x98
[] driver_register+0x78/0x7d
[] __pci_register_driver+0x4f/0x5f
[] cciss_init+0x1d/0x1f [cciss]
[] sys_init_module+0x191/0x198
[] sysenter_past_esp+0x56/0x8d
blocks= 569006235 block_size= 512
heads= 255, sectors= 63, cylinders= 35419

cciss/c0d0: p1 p2 p3 p4 < p5 p6 p7 >
rick jones
Honored Contributor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

The bit with cciss should be a separate thread and/or an official call to HP support :)
there is no rest for the wicked yet the virtuous have no pillows
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

I'm still waiting for the problem to occur with kernel 2.6.18.
Acually the real test began this monday, because I left by mistake the software checksum enabled.
I'll tell you next monday, when the server is used for all the weekend full backups, if this can be a real linux driver/kernel issue.

Regards,
Diego.
Diego Liziero
Advisor

Re: Tigon3: checksum handoff produces corrupted packets with correct checksum

Let's say that I'm unable to reproduce the corruption with kernel 2.6.18 (unpatched).

This means that the latest kernels of boths RHEL 3 and RHEL 4 are unable to support properly HP Proliant ML 350 G3 servers ethernet.

Any plan to fix this issue on those kernels?

Regards,
Diego.