- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: Tigon3: checksum handoff produces corrupted pa...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2006 12:35 AM
09-13-2006 12:35 AM
In our HP based server farm we had some troubles with network based software.
After months of debug has been found the cause:
a correct TCP/IP packet (still correct when captured in the sending host before hardware checksum) sometimes reaches the destination host with a correct checsum but with corrupted payload.
Forcing the software checksum seems to solve the problem but slows down the network transfer rate.
It seems an hardware/firmware/driver issue.
Any suggestion?
Regards,
Diego.
Sending system: HP Proliant ML 350 G3
ethernet: eth0: Tigon3 [partno(N/A) rev 1002 PHY(5703)] (PCI:33MHz:32-bit) 10/100/1000BaseT Ethernet
OS linux RHEL 3.x and 4.x (same troubles with a 2.4 and 2.6 kernel)
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2006 01:13 PM
09-13-2006 01:13 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
I will ask which version of the tg3 driver you are running in each case.
Oh, one other question - is this with the "standard" size or "jumbo" frames?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2006 07:14 PM
09-13-2006 07:14 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
Driver version: tg3.c:v3.27-rh (May 5, 2005) for kernel 2.6
tg3.c:v1.5 (March 21, 2003) for 2.4 (boths RedHat kernels)
MTU is always 1500 (no jumbo frames)
Here is an example (od -c and a diff -u) of a corrupted payload received with correct checksum (payload size is 556 bytes).
--- payload-original-od-c 2006-09-14 08:30:14.000000000 +0200
+++ payload-received-od-c 2006-09-14 08:30:27.000000000 +0200
@@ -1,9 +1,9 @@
0000000 200 \0 002 ( \0 \0 \0 1 \0 \0 \0 \0 \0 \0 \0 020
0000020 \0 \0 \0 \v \0 \0 \0 026 \0 \0 \0 \t \0 \0 \0 001
-0000040 177 377 377 377 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0
-0000060 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 004 \0 \0 \0 004
-0000100 \0 \0 \0 001 \0 \0 002 023 \0 \0 \0 \0 \0 \0 \0 001
-0000120 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
+0000040 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 001 \0 \0 \0 024
+0000060 \0 \0 \0 \a \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0
+0000100 \0 \0 \0 002 001 \b 260 001 \b 020 \0 \0 \0 \0
+0000120 \0 \0 \0 001 \0 \0 \0 \v 001 0 \b 020 001 0 017 240
0000140 \0 \0 \0 001 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 006
0000160 \0 \0 001 370 \0 \0 \0 004 \0 \0 \0 001 \0 \0 \0 001
0000200 \0 \0 \0 032 g a l a t e a . c o m u
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-14-2006 04:32 AM
09-14-2006 04:32 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
tg3 has revved a bit since those versions. If the corruption still happens after getting on to the latest versions I'd suggest calling it in as a hardware problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-14-2006 11:51 AM
09-14-2006 11:51 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-14-2006 05:58 PM
09-14-2006 05:58 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
As stated before, disabling rx hardware checksum on the receivers can't help, as the checksum is really correct despite the corrupted payload. With hardware tx checksum disabled on the sending host we are not able to reproduce the issue, but the bit transfer rate is decreased.
> Interesting that is is more than just a single-bit error or the like [..]
Yes, I though of something like a buffer overflow, I'll have a look at the nearby captured packets to see if I can find the corrupted pattern.
> tg3 has revved a bit since those versions
I'll try the latest on psp-7.60.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-15-2006 04:36 AM
09-15-2006 04:36 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
Please download the script nic.sh to the linux machine and do the following at the command prompt logged in as root:
# dos2unix nic.sh
# sh nic.sh > /tmp/nic.log
Attach the nic.log so I can investigate the network settings and information.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-17-2006 10:13 PM
09-17-2006 10:13 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
Here it is.
rick jones> tg3 has revved a bit since those versions. If the corruption still happens after getting on to the latest versions I'd suggest calling it in as a hardware problem.
Tested with the latest tg3.c:v3.58c (May 19, 2006) and the corruption before the hardware checksum happened again.
Sure it can't be a driver/firmware issue?
I mean, it happens during heavy load (the server is the tape backup server and the corruption happens while it sends packets to the clients during the backup process)
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2006 01:57 AM
09-18-2006 01:57 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
Is it posible to either swap eth0 with eth1 or use one of the four ports on the sundance card to try?
I did not see anyting that suprised me in the log file.
Where are you capturing the information of the checksup issue at?
Using ethereal on the system under investigation will show checksum problems due to the hardware stripping the checksum off since it is a hardware performed task by default. To alleveate this, a sniffer must be put in line for the capture. A quick check can be used to see that the checksum is enalbed with "ethtool -k eth0".
If swapping the eth's shows the problem still present on eth0, then I would have to agree with the earlier statement that it is a hardware issue and the mother board ethernet port has a issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2006 02:16 AM
09-18-2006 02:16 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
http://h18000.www1.hp.com/support/files/server/us/download/24497.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2006 04:43 AM
09-18-2006 04:43 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
This may have been covered already, if so please forgive me - when you say that forcing CKO to be off "solved" the issue, do you mean that the corruption stopped entirely, or that it was caught by software checksums and not passed along? Are you able to capture similarly corrupted packets with CKO disabled?
Is the corruption always in the same place(s) in the packet?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2006 06:20 PM
09-18-2006 06:20 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
I swapped the two tg3 ethernet ip & cables, I'll tell you what happens.
As regard the capturing process, well, I start tethereal one minute before the backup process on the backup server (where the corruption happens) and on all linux backup clients.
I didn't write a program to check all the packets, as usually I can find them because the backup agent complains with messages like "invalid data", "data corruption" or sometimes it crashes completely.
This usually happens during the initial dialog of the backup process, and more often during the busy weekend full backups.
At this point it easy for me to search the packets immediately before the error (or the crash) in boths sides of the communication to find the corrupted payload.
Probably with software checksum the corruption is caught by the checksum at the receiving host.
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2006 06:28 PM
09-18-2006 06:28 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
When hardware checksum is enabled (the default with tg3 driver) the packets captured on the sending host have not the checksum computed as it will be done by the ethernet hardware, but the payload is always valid, at the receiving hosts, sometimes the packets captured have a corrupted payload but always with correct checksum computed by the hardware of the sending host.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-19-2006 04:34 AM
09-19-2006 04:34 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-25-2006 06:27 PM
09-25-2006 06:27 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
See the link below (unfortunately no diagnosis).
http://bugs.centos.org/view.php?id=1121
By the way using the tg3 at Gb speed (instead of 100Mb) seems at least to reduce the frequency of the bug.
Anyone has any idea on how I can test the eth without having to wait for a crash of our network software? I was thinking about transferring big checksumed random files, but this way it's hard to find the exact packet where the corruption happens.
Anyone knows if there are already some network testing programs that check the data besides the tcp checksum, without having to write one myself?
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2006 04:46 AM
09-26-2006 04:46 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-29-2006 05:17 AM
09-29-2006 05:17 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
With software checksum, the checksum is calculated as the socket data is copied into kernel space. With hardware checksum, the checksum is calculated by hardware a lot later when the data is DMA'ed into the NIC buffers. If the data is corrupted sometime in between, the hardware will calculate the "correct" checksum on the corrupted data. That may be why turning off tx checksum offload works.
So I suggest copying files with a known data pattern and see if we can spot some trends on how the data is corrupted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2006 06:08 PM
10-01-2006 06:08 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
I wrote a really simple echo server and two clients sending 32bit little endian long from 0 to INT_MAX.
Unfortunately this way I have no idea on where exactly the corruption happened in the packets. (maybe I have to play with TCP_CORK to simulate a SOCK_SEQPACKET)
After one night at 1gbit/sec here are two corruption patterns:
--- sent1 2006-10-02 07:37:48.000000000 +0200
+++ received1 2006-10-02 07:38:13.000000000 +0200
@@ -4,8 +4,8 @@
076 154 070 079 077 154 070 079 078 154 070 079 079 154 070 079
080 154 070 079 081 154 070 079 082 154 070 079 083 154 070 079
084 154 070 079 085 154 070 079 086 154 070 079 087 154 070 079
- 088 154 070 079 089 154 070 079 090 154 070 079 091 154 070 079
- 092 154 070 079 093 154 070 079 094 154 070 079 095 154 070 079
- 096 154 070 079 097 154 070 079 098 154 070 079 099 154 070 079
- 100 154 070 079 101 154 070 079 102 154 070 079 103 154 070 079
+ 046 157 230 014 047 157 230 014 026 150 230 014 027 150 230 014
+ 028 150 230 014 029 150 230 014 030 150 230 014 031 150 230 014
+ 032 150 230 014 033 150 230 014 034 150 230 014 035 150 230 014
+ 036 150 230 014 037 150 230 014 038 150 230 014 039 150 230 014
104 154 070 079 105 154 070 079 106 154 070 079 107 154 070 079
108 154 070 079 109 154 070 079 110 154 070 079 111 154 070 079
112 154 070 079 113 154 070 079 114 154 070 079 115 154 070 079
--- sent2 2006-10-02 07:39:43.000000000 +0200
+++ received2 2006-10-02 07:40:06.000000000 +0200
@@ -4,8 +4,8 @@
140 087 007 094 141 087 007 094 142 087 007 094 143 087 007 094
144 087 007 094 145 087 007 094 146 087 007 094 147 087 007 094
148 087 007 094 149 087 007 094 150 087 007 094 151 087 007 094
- 152 087 007 094 153 087 007 094 154 087 007 094 155 087 007 094
- 156 087 007 094 157 087 007 094 158 087 007 094 159 087 007 094
- 160 087 007 094 161 087 007 094 162 087 007 094 163 087 007 094
- 164 087 007 094 165 087 007 094 166 087 007 094 167 087 007 094
+ 174 062 064 007 175 062 064 007 154 055 064 007 155 055 064 007
+ 156 055 064 007 157 055 064 007 158 055 064 007 159 055 064 007
+ 160 055 064 007 161 055 064 007 162 055 064 007 163 055 064 007
+ 164 055 064 007 165 055 064 007 166 055 064 007 167 055 064 007
168 087 007 094 169 087 007 094 170 087 007 094 171 087 007 094
172 087 007 094 173 087 007 094 174 087 007 094 175 087 007 094
176 087 007 094 177 087 007 094 178 087 007 094 179 087 007 094
It seems that at some point old data from other packets is copied to the ethernet before the hardware ckecksum.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-02-2006 10:18 AM
10-02-2006 10:18 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-08-2006 06:20 PM
10-08-2006 06:20 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
After a week of tests I can say that:
- when I send a payload that fits exactly one tcp packet, the error never happens.
- When I send a payload that fits one packet and 32 bit of a second one, the error happens for 64 bytes after 32 bytes on the full packet with data of previous packets.
- When I send a payload that fits one packet and 192 bit of a second one, the error happens for 64 bytes after 192 bytes on the full packet with data of previous packets.
And so on.
As the error happens only on the mainboard ethernet, I'm going to call support to get another mainboard, unless some of you has a better idea.
Thanks for your help.
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-09-2006 09:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2006 04:10 AM
10-11-2006 04:10 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
By now there is a strange warning when cciss module is loaded, but this is another unrelated thing:
HP CISS Driver (v 3.6.10)
ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 24 (level, low) -> IRQ 193
cciss0: <0x46> at PCI 0000:02:02.0 IRQ 193 using DAC
blocks= 569006235 block_size= 512
heads= 255, sectors= 63, cylinders= 35419
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
Inexact backtrace:
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
blocks= 569006235 block_size= 512
heads= 255, sectors= 63, cylinders= 35419
cciss/c0d0: p1 p2 p3 p4 < p5 p6 p7 >
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2006 04:37 AM
10-11-2006 04:37 AM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-19-2006 09:32 PM
10-19-2006 09:32 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
Acually the real test began this monday, because I left by mistake the software checksum enabled.
I'll tell you next monday, when the server is used for all the weekend full backups, if this can be a real linux driver/kernel issue.
Regards,
Diego.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2006 05:52 PM
10-29-2006 05:52 PM
Re: Tigon3: checksum handoff produces corrupted packets with correct checksum
This means that the latest kernels of boths RHEL 3 and RHEL 4 are unable to support properly HP Proliant ML 350 G3 servers ethernet.
Any plan to fix this issue on those kernels?
Regards,
Diego.