Operating System - Tru64 Unix
1831554 Members
3842 Online
110025 Solutions
New Discussion

Re: No ACK being sent.

 
Christof Schoeman
Frequent Advisor

No ACK being sent.

Hi, hope you can help me.

On this Tru64 V5.1B-3 (PK5) system, there is one particular connection which get dropped after running for while.

In the trace, one can see the packets arrive, but no ACK is being sent.

Why could that be? How/where do I look further?

Attached is an extract of the trace (IP addresses hidden to protect the innocent:-)

13 REPLIES 13
Ivan Ferreira
Honored Contributor

Re: No ACK being sent.

What about ping and other tcp connections? Can you run tcpdump on both servers? What is the status displayed in netstat on both servers. Also what are the routing tables on the server. Maybe fragmentation being dropped?
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Christof Schoeman
Frequent Advisor

Re: No ACK being sent.

The strange thing about this issue, is that ping and other TCP connections are not affected.

I did run a trace on both systems, and they say exactly the same thing - no packets get lost along the way.

I couldn't see anything strange in netstat. Is there anything in particular that you would look out for?

Re: No ACK being sent.

Run tcpdump on both servers at the same time, check for fragmented packets, any errors in the logs.

David
Al Licause
Trusted Contributor

Re: No ACK being sent.

What is the application your running ?
Is it a standard unix app or a home grown one ?

From the dump you've supplied its a bit hard to tell what occured just before B stopped acknowledging A's transmissions....

It would be nice to see if both sides were playing well in that no packets were dropped or being sent out of sequence.....and/or duplicates were being sent and not acknowledged. Or the windows size was reduced significantly at any time.

It appears that B simply gives up and A tries to resend the same packet repeatedly until it finally gives up. More data would be needed from both sides in order to further assist.
Christof Schoeman
Frequent Advisor

Re: No ACK being sent.

Hi there

I did indeed run a trace on both sides, but they seem to play along nicely. I've attached the trace for the other side (header information left intact, just changed the IP-addresses for ease of reference).

The application I think is called Shadowbase, and is used to replicate database tables from a Nonstop server to a Tru64 server.

If you say you would need more information to assist, what information would that be? That is basically where I'm stuck. Where does one look further?

Thanks again for your advice and questions.
Al Licause
Trusted Contributor

Re: No ACK being sent.

The last trace you provided is a bit tough to follow. The nice thing about looking at a tcpdump output or Ethereal, if that is what this is, is to compare line by line and see just what is happening.

If you need more detail, you can use the other two frames in Ethereal.

This was a good next step though. Essentially, your looking to see if each packet sent by each sender actually arrived at the receiver at about the same time frame. When one system simply stops responding as we saw in your first dump file, it's usually that system that is having some sort of problem...either in processing the incoming data or in not receiving all packets it should have.

If the application has any debugging capabilities, enable them as well, particularly on the side that stopped responding. From the first example it appears that the sender simply aborted the link after a period of no response from the other system.

So look for dropped packets, duplicate transmissions or similar occurances. If you don't see any then try to understand why the receiving system stopped responding.


Mark Poeschl_2
Honored Contributor

Re: No ACK being sent.

It would be interesting to compare the trace (from both sides again) while the communication is proceeding normally.

Re: No ACK being sent.

Any chance this is occuring over a netRAIN interface where each member of the netRAIN is on a different layer 2 switch? I ran into something like this once a long time ago on a 2-node ES47 cluster (each node with 4 DEC602 cards comprising two 4-member netRAIN interfaces).

The problem was that the switches were getting their arp caches poisoned by an intermittently failing NIC that was chattering and causing failovers REALLY quickly on netRAIN set. Network traffic that forced an arp would succeed but stuff that relied on an arp cache outside the server would fail until the cache was updated.

I doubt this is your problem based on the trace but it might not hurt to rule it out.

Jack
Christof Schoeman
Frequent Advisor

Re: No ACK being sent.

Al, even though the second trace is a bit tricky to follow, one can still see that every packet sent from B is received by A, in reasonable time, and the same for packets from A to B. This would mean that no packets get lost/dropped and I don't see any duplicate packets either (which should be handled by TCP anyway).

The way I understand TCP, is that it should handle all the transmits, receives, acknowledgments, duplicates, timeouts or whatever makes the data go from A to B and back, RELIABLY. And the application relies on TCP for that. What I'm trying to say, is that nothing the application can do, would prevent TCP from acknowledging a packet that's been received. Correct me if I'm wrong, please.

What we see then, is that a packet is received, but not acknowledged. We can see this packet in both the traces. What on earth could cause this? I am therefore trying to find out what makes the receiving system stop responding. All this, while other sockets appear to be unaffected.

Mark, the traces, while things are running normally, is like watching a rather boring ping pong match. I will post a section if you really want to see it;-)

Jack, this happens over a single interface, LAG'ed interfaces, netRAIN, we've tried them all.

Oh my head hurts.
Ivan Ferreira
Honored Contributor

Re: No ACK being sent.

You may try disabling the tcp delayed acknolegment:

sysconfig -r inet tcpnodelack=1

sysconfig -q inet tcpnodelack
inet:
tcpnodelack = 1
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Mark Poeschl_2
Honored Contributor

Re: No ACK being sent.

Hi Christof -

I agree - it looks like something in the IP stack of system B is messing up. From what I can see the "conversation" goes like this:

A sends bytes 1518827 through 1520279

A sends bytes 1520287 through 1521739

A sends bytes 1521747 through 1522769 with a flag indicating that all data must be delivered to the application

B says the next byte I'm expecting is 1520287

A promptly complies and resends bytes 1520287 through 1521739
....
again and again with longer gaps in between attempts (expected behavior in TCP)

A finally gives up and resets the socket.

So for some reason B "missed" the last two packets that A sent despite the fact that they fit in B's advertised Window value.

As previously suggested application level tracing would help determine what data is actually entering / leaving the IP stack on system B. A net dump of a period of normal communcation might shed some light on what the normal use of the PUSH bit looks like between these two IP stacks/applications.

Are you able to fiddle with things like the TCP_NODELAY flag on the sockets on either end?
Al Licause
Trusted Contributor

Re: No ACK being sent.

Ah...LAG....fun stuff....:(

Which algorithm are you using ?
Hopefully not round robin ?

Round robin is subject to packet loss when using tcp in certain conditions.

We would also have to assume that all nic's in the lag set are connected to the same switch.
Is this the case ?

Are both sender an receiving using LAG ?
Christof Schoeman
Frequent Advisor

Re: No ACK being sent.

Mark, thanks for confirming that I have not gone completely mad:-)

Ivan, thanks for the tcpnodelack tip. I'll be sure to give it a go.

If that doesn't work, I guess an applicatin level trace is all that is left.

Al, we didn't use round robin for LAG, but we did have LAG going to two different switches (do not try this at home). We have since gone to NetRAIN, also to different switches.