TCP Zero Window Troubleshooting

Chris Naudé · ‎04-15-2008

Problem background: We have a home grown application which has a client and server configuration. About once a week the the server no longer accepts inbound traffic from the client.

We currently have our send and receive buffers at their default values. Now I can understand the buffers filling up and then the server needs to catch up and process. The issue seems to be that it fills up and then the server never grabs anything from the receive buffer. It continues to happily send outbound packets all day long.

We're completely perplexed by this. I've been using tusc trace the syscalls during these outages but that has provided nothing useful.

I have a feeling the server reaches some sort of limit to cause it to stop processing inbound messages. It gets stuck and refuses to pull data off of the full queue.

Does anyone have any idea what could cause this sort of thing from happening? I've been using tusc and nettl. Are there any other metrics that I should be looking at that might help?

This is running on HP-UX 11.11 PA-RISC.

Matti_Kurkela · ‎04-15-2008

What you'll need is an understanding of the application's internal state at the time it stops/has stopped reading the inbound packets.

This is almost certainly an application-level problem: the buffers are getting full because the server program is not issuing any recv()/recvfrom()/recvmsg() system calls to read the incoming data from the network socket. So get the source code and start finding out how the tusc trace matches up with whatever *should* be happening.

Your server program probably has a loop of some sort, which makes the program wait until some data arrives, then reads the incoming data, starts any necessary processing and then waits for more data. You should examine what is happening in this loop. If this loop uses a counter variable of any sort, think about what happens when (not if!) it overflows.

If you find a suspicious counter variable, you might add an offset to it, so that instead of starting from zero the counter starts from (MAXVALUE - ). You can subtract this offset out if you need to actually use this value. This change makes the overflow happen very soon after program startup, so if the problem is related to that, it should be easier to diagnose.

If your server program is multithreaded, that opens an entirely different can of worms - you must ensure no pair of threads can indefinitely block each other from proceeding, or else you risk a deadlock.

Multiprocessing-oriented programming languages and other programming tools can limit the risk of deadlocks, but won't completely eliminate it unless the tools become smarter than their user :-/

MK

MK

Chris Naudé · ‎04-16-2008

Thanks for the information.

I do have access to the source code so I'll start pouring over it. Unfortunately I'm not a C++ programmer by profession, though I have had formal training in C and C++. I have also given the tusc output to the developers.

The server is multithreaded.

YAQUB_1 · ‎04-16-2008

Hi Chris,

I think Ur misunderstand with Mr. MK, basically he will try to understand Ur system kernel â MAXFILESâ -- sets the soft limit for the number of files a process is allowed to have open simultaneously
Minimum value-30 & maximum value-6000 but standard value is-512

Note:- To be useful, the value assigned to maxfiles must be less than the value of maxfiles_lim. maxfiles_lim is useful only if it does not exceed the limits imposed by nfile and ninode.

If U got Ur answer please assign points...

Thanks--Yaqub
I am a Customer Advocate!!!

Chris Naudé · ‎04-16-2008

I wish it was as simple as MAXFILES. I've actually verified that the application doesn't open very many files at all. It only opens a few files at a time while it runs. The socket connections are minimal as well.

What I'm going to to do next is monitor each thread individually. There are only about 30 threads. If one of those threads start misbehave then I will hopefully be able to figure out what part of the application is broken. We're also going to add additional logging so that we can match up the TID with a specific function.

Lastly what I would really like to do is run tusc at the exact moment the outage occurs. I just need to roll the tusc log so that I don't fill up the disk. I also need to determine if tusc causes and impact to the attached process.

Chris Naudé · ‎04-21-2008

We have finally found out where the problem is. I was running glance plus during the outage and looking at all of the threads. When the event occurs a new thread is created. The thread cpu usage goes to 100% and the stop reason stays at PRI. During this time all other threads go to 0% cpu. The trick now is to figure out what the thread is doing. The developers now have something to go on. If needed I'll run wdb and gather some stack traces for the problem thread.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

TCP Zero Window Troubleshooting

TCP Zero Window Troubleshooting

Re: TCP Zero Window Troubleshooting

Re: TCP Zero Window Troubleshooting

Re: TCP Zero Window Troubleshooting

Re: TCP Zero Window Troubleshooting

Re: TCP Zero Window Troubleshooting