ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Transmit Load Balancing fails on Bl20p G3

Anders_35
Regular Advisor

Transmit Load Balancing fails on Bl20p G3

Not sure whether this is a server of windows question, but ...

We have an odd problem with our BL20's. If we configure teaming out-of-the-box, they seem to work allright, but on closer inspection, lot's of network traffic just disappears.

After some testing, we see that when using transmit load balancing, with method "TCP Connection" or "Destination MAC", we loose a lot of outbound tcp-traffic. It just doesn't seem to leave the server, even if there are no errors reported anywhere.

Anyone seen this before?

HP just says "latest firmware, latest drivers", but looking through the release notes there are no mentions of any such errors fixed. We have had way too much trouble and downtime with unecessary upgrades, for me to be willing to just turn around and upgrade just like that.
15 REPLIES
Connery
Trusted Contributor

Re: Transmit Load Balancing fails on Bl20p G3

Anders,
Can you provide more information about how you know that "we loose a lot of outbound tcp-traffic. It just doesn't seem to leave the server"?

That will help diagnose the issue.

Have you verified that all NIC ports in the team are connected to switch ports in the same VLAN? Are both blade switches on the same broadcast domain/subnet?

We have a Teaming Whitepaper for reference also:
ftp://ftp.compaq.com/pub/products/servers/networking/TeamingWP.pdf

Best regards,
-sean
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

The servers are set up with teaming on NIC 1 and 3, ie. one nic on each GbE2 switch.
There is only one link out of the blade enclosure, on switch A, so all traffic passes that switch.

When sniffing the network on switch A, we should see the traffic coming through on the local server port, or on the port 17/18 interconnect.

But, when it doesn't work, we do not see any traffic from the server on these ports.

If I switch to NFT, I can switch between both nics, and it works fine. Ie. we have a working connection to the network through both GbE2s.

My first test indicate that the problem varies with each TLB method. For instance with "TCP Connection" it works on/off, in uneven periods of 30 to 50 minutes.
Dest. MAC: Doesn't work most of the time.
Dest. IP: Seems to work.
Round-robin: Seems to work

But after upgrading firmwares, and installing proliant support pack 7.40B (was 7.20) Dest. IP fails occasionally, too.
I'm now going to try PSP 7.51.

The tests we perform is simple:
We use an smtp-client to send a small email.
When it doesn't work, the client doesn't get a TCP-connection. (And we never see packets on the network, not a single SYN).

Since the web-servers running on these servers are all OK, I believe it is just outbound connections (initiated on the server) that are affected.
Carsten Reinhard
Frequent Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Anders,

I read this morning an HP advisory:

Description: Advisory: ProLiant Servers May Become Unresponsive in Configurations Running HP Network Teaming Software and Running Microsoft Windows Server 2003 SP1 with the Scalable Networking Pack (SNP) and TCP/IP Offload Engine (TOE) (c00747687)

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c00747687&jumpid=em_EL_Alerts/US/Sep06_ALL/Alerts

Maybe you have to install the mentioned MS-Patch.


Greetings Carsten
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Thanks, I'll have a look at that.
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Some additional info:

I've now updated every firmware and driver I can, but still no change.

Connery
Trusted Contributor

Re: Transmit Load Balancing fails on Bl20p G3

Troubleshooting steps:
1. Disable NIC 1 in the Team via the Microsoft UI (Network Connections). Run tests again.

2. Enable NIC 1 and disable NIC 2. Run tests again.

What's the behavior change? If one NIC works fine and another doesn't, you need to look at the switch configs for that port.

Also, are you running version 8.37 of the Teaming driver?
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Thanks Sean, but the switch configs are identical, at least within each enclosure, I checked.
The problem also occurs on at least three different blade enclosures (I didn't test the rest we have),
on two different switch firmware versions, and two different switch configs (the switches in one enclosure are different from the switches in the other two).

Also, when using NFT, both nics work like a charm, so I would think a failure on one nic over the other would be more indicative of an error in teaming than in the switch.
But I am testing it, just to be sure.

Tomorrow I'm going to do some network sniffing on all these six switches, just to confirm that I'm seeing the same everywhere.

>Also, are you running version 8.37 of the
>Teaming driver?

Yes, I am now, after upgrading to support pack 7.51. Still no luck, though...
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

I ran the suggested test, with one NIC disabled and one enabled. It works just fine in both configurations.

As soon as I switch back to using both NICs it's failing again...
Connery
Trusted Contributor

Re: Transmit Load Balancing fails on Bl20p G3

What are you using to test connectivity? PING or something else?

Some apps/devices (ex. IP interfaces on some printers) don't like receiving data frames from the non-Primary NIC in the team because the source MAC address doesn't match the MAC address in it's ARP cache for the Team's IP address. A properly implemented TCP/IP stack shouldn't care, but it doesn't always get implemented properly on all devices.

That being said, make sure you are testing connectivity with a Windows system using a Windows utility (like PING). I know they work.

Another alternative is to use the Dual Channel team type. It does require an Intelligent Networking Pack license, though.
http://h18004.www1.hp.com/products/servers/proliantessentials/inp/index.html

Dual Channel uses separate ARP replies for each channel. Therefore, the other device always sees a data frame's source address that matches the ARP cache entry it received from the team.

If TLB is still causing you a problem and you don't want to use Dual Channel, I'd recommend you opening a case with our support team. The team can be reached by calling 800-354-9000 and have the call be sent to the USS_SC NETWORK queue.

Best regards,
-sean
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Finally, some progress here...

First, I'm happy (well.. not really) to report that I was wrong. The packets are sent out on the network. Our first network scans didn't pick it up. I suspect that the monitor port was tagged to the wrong VLAN, at least when I deleted it, and set it up again, we saw the traffic we expected.

This means that it is not the teaming after all.

It turns out that it is only traffic that goes across one specific router, that fails.
(Actually, it's a Checkpoint firewall, not just a router).
It returns traffic to the same MAC-adress that it came from, and not to the primary-NIC's MAC-address.
And that, of course, isn't going to work.

I'm still puzzled as to why the "TCP Connection" method seems to fail in (almost) regular intervals, but I guess that could be a question of how source port numbers vary.

Thanks for all the help and suggestions!
Cheers,
Anders :)
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

.
Van de Vyvere Carlo
Occasional Advisor

Re: Transmit Load Balancing fails on Bl20p G3

Hello,

you said you solved the problem on your firewall. I've got the same troubels and maybe my central switch (extreme black diamond) is having the same effect. But how did you solved that ?

Thx
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

No, we never really "solved" it.
Since we don't really need TLB, we just fell back to using NFT instead.

Our checkpoint vendor has confirmed that fw-1 doesn't use arp lookup for the replies, instead it remembers connection information.

Whether this is by accident or design I don't know.

Anders :)
Anders_35
Regular Advisor

Re: Transmit Load Balancing fails on Bl20p G3

By the way.. make sure that your switch will accept receiving the same MAC on several ports.
Connery
Trusted Contributor

Re: Transmit Load Balancing fails on Bl20p G3

Just for clarification:

It is a violation of IEEE standards for a switch to receive the same MAC address on multiple ports unless the switch is specifically configured for Port Trunking/Port Channeling/Port Bonding/802.3ad. All of those terms describe a special feature on some switches that allows multiple ports to be grouped together to operate as a single virutal switch port (thus accepting the same source MAC address on any port in the group).

NIC Teaming does not require the switch to receive the same MAC address on multiple ports unless you are using an 802.3ad Team type (SLB, Dual Channel, 802.3ad dynamic, etc.). If you are using any of these team types, the switch much support 802.3ad and must be configured properly.

Other team types, NFT and TLB, transmit frames using a different source MAC for each NIC in the Team.

Please see the Teaming Whitepaper for additional information:
ftp://ftp.compaq.com/pub/products/servers/networking/TeamingWP.pdf

Fundamentally, the issue of this thread is this - some devices do not properly implement ARP. The device should always use ARP data (from ARP Request, ARP Reply, or static ARP entry) to determine the MAC address to use for any IP address the device is talking to. Some devices may use the source MAC address from the received frame as the MAC address to respond to. This behavior will not work with Teaming mode TLB (or Automatic without 802.3ad support on the switch) because non-Primary ports in a TLB team transmit with a MAC address different than the MAC address that is provided in ARP replies. This means the device responds directly to the non-Primary NIC, which drops the frame, instead of to the Primary NIC (used for all receives in a TLB team).

Solutions for this problem are to use NFT, SLB, Dual Channel, or any 802.3ad team type. Another solution is to request the manufacturer of the device to implement full ARP functionality in the device.

Best regards,
-sean