Networking
cancel
Showing results for 
Search instead for 
Did you mean: 

Inconsistent TCP connection rejection.

TMabbs
Occasional Visitor

Inconsistent TCP connection rejection.

I wonder whether anyone might be able to shed some light on some peculiar behavior we seeing on some clients' systems?

 

Kerberos authentication software; it tries to connect to a KDC (TCP port 88); if that fails, it goes to the next listed KDC, and so on until all KDCs have been exhausted or one has replied.  This is fairly straight-forward.

 

The problem is where the client has retired some of their KDCs (and the configuration information on which to contact has not yet been updated).  The retired KDCs simply have their Kerberos service disabled, so they are still physically present on the network, and have appropriate entries in DNS, but there is just no service running on port 88.

 

So I would expect our software to attempt to connect and receive an immediate RST response to the initial SYN packet.  The connection should be rejected immediately, and our code should move on to trying the next KDC.

 

Indeed, that is what happens ... almost all of the time.  The problem is that we are occasionally seeing these connection rejections taking between about 50 and 80 seconds (typically at the long end of that range; a few have happened around the 50 second mark) before the connection attempt is being rejected.  It's as though the TCP slow timeout was being tripped on the initial SYN packet rather than a RST packet being received, and since there will have been at least 2 SYN packets sent during that time, both the RST responses must have been lost (or both the actual SYN packets must have).

 

What makes the matter harder to diagnose is that when we use "nettl" to trace the network traffic, the problem doesn't occur!  OK, I can't say that absolutely definitively, but in tests the problem would typically have occurred at least once within a couple of minutes of repetitive trying; with network tracing running, the problem didn't occur at all in about 20 minutes of repetitive trying.

However that could be a complete red herring - when first we tried, it was at about 05:30 local system time; when we were trying subsequently it was about 09:00, so that could also have had an effect (although as world-wide accessed development - not production - servers, that probably shouldn't have made that much of a difference ...).

 

So if anyone could shed some light on this inconsistent behavior, or suggest any means by which we can sensibly investigate it further, that would greatly be appreciated.

 

Many thanks folks!

 

Tris Mabbs.