Re: Problem with CLOSE_WAIT

 
Praveen Bezawada
Respected Contributor

Problem with CLOSE_WAIT

HI
We are running a Java application on a HP rp5470 running HPUX 11.11.
After the application runs for a few days we notice that there are connections left in CLOSE_WAIT state. He suspect that this is the problem with the application that it is not acknowledging the FIN from the peer.
But we are not able to locate the PID of the process which causes this. (There are several processes in the application).
And strangely when we used lsof, the open connections are not listed at all but netstat -an shows them in CLOSE_WAIT.
The lsof binary that we use is:
lsof.hp version information:
revision: 4.65
latest revision: ftp://vic.cc.purdue.edu/pub/tools/unix/lsof/
latest FAQ: ftp://vic.cc.purdue.edu/pub/tools/unix/lsof/FAQ
latest man page: ftp://vic.cc.purdue.edu/pub/tools/unix/lsof/lsof_man
configuration info: PSTAT-based
constructed: Wed Oct 9 11:59:41 PDT 2002
constructed by and on: abe@hpux
compiler: /bin/cc
compiler flags: -DHPUXV=1111 -D_PSTAT64 -Ae +DD32 -DLSOF_VSTR="B.11.11" -O
loader flags: -L./lib -llsof -lnsl
system info: HP-UX hpux B.11.11 U 9000/820 2000287533 unlimited-user license

Any ideas how we can get the pid of the process which is holding this connections.

Thanks for the help.

...BPK...
9 REPLIES 9
David Child_1
Honored Contributor

Re: Problem with CLOSE_WAIT

You have probably already looked at this, but just in case, what options are you using with lsof? If you run something like:

lsof | grep

and this particular port is in /etc/services, you won't get anything. In this case try:

lsof -P | grep

David
Uday_S_Ankolekar
Honored Contributor

Re: Problem with CLOSE_WAIT

Try using ndd to get tcp status by

ndd -get /dev/tcp tcp_status

-USA..
Good Luck..
Gordon  Morrison
Trusted Contributor

Re: Problem with CLOSE_WAIT

I have seen this problem many times on various flavours (TRU-64, HP-UX & Solaris). Admittedly, it is more common on TRU-64.

The system *should* go around every so often and clean up these "dead" connections (by default, 15 minutes on TRU-64), but it doesn't always.

There is a kernel parameter in TRU-64 (I forget which one, I haven't worked with TRU-64 for nearly a year) which will let you specify how long the system waits between these cleanups, but even that doesn't always work. There may be a similar parameter in HP-UX

The only solution to this problem that I (or anyone I have ever asked) know is to reboot.
What does this button do?
Stephen Keane
Honored Contributor

Re: Problem with CLOSE_WAIT

what does

# ndd -get /dev/tcp tcp_fin_wait_2_timeout

give? I suspect 0, in which case the connections will stay around forever (or next reboot). If it is zero, you could change it to say 10 minutes (600 seconds)

# ndd -get /dev/tcp tcp_fin_wait_2_timeout 600


Praveen Bezawada
Respected Contributor

Re: Problem with CLOSE_WAIT

Hi Uday
Thanks for the suggestions.

ndd -get /dev/tcp tcp_status gives me the error.
operation failed, Invalid argument

I think i need to get the patch PHNE_31965 for this.
Praveen Bezawada
Respected Contributor

Re: Problem with CLOSE_WAIT

Hi Stephen
I donot think setting the value for FIN_WAIT_2 will help in this case. Connection goes to fin_wait_2 state on my machine if my application has sent the FIN and is waiting for FIN from the peer.
The connection goes to close_wait if the peer has sent the FIN but my application has not seen it.
In a way CLOSE_WAIT and FIN_WAIT_2 are complementary. If I have CLOSE_WAIT at one side, it willbe FIN_WAIT_2 on the other.

Praveen
Stephen Keane
Honored Contributor

Re: Problem with CLOSE_WAIT

In the FIN_WAIT_2 state, we have sent our FIN and the other end has acknowledged it. Unless we have done a half-close, we are waiting for the application on the other end to recognize that it has received an end-of-file notification and close its end of the connection, which sends us FIN. Only when the process at the other end does this close will our end move from FIN_WAIT_2 to the TIME_WAIT state.

This means our end of the connection can remain in this state forever. The other end is still in the CLOSE_WAIT state, and can remain there forever, until the application decides to issue its close.

Many Berkeley derived implementations prevent this infinite wait in the FIN_WAIT_2 state as follows. If the application that does the active close does a complete close, not a half-close indicating that it expects to receive data, then a timer is set. If the connection is idle for 10 minutes plus 75 seconds, TCP moves the connection into the CLOSED state.

Praveen Bezawada
Respected Contributor

Re: Problem with CLOSE_WAIT

Hi Stephen
OK, I will try out this option andsee if it helps.
I think I need to apply the patch anyways.

Thanks
BPK
rick jones
Honored Contributor

Re: Problem with CLOSE_WAIT

Indeed, the tcp_fin_wait_timout kludge will not address connections in CLOSE_WAIT.

As the OP surmised, 99 times out of 10, a connection in CLOSE_WAIT means an application ignored a read/recv return of zero and has not decided to close at its end. This could result from:

1) an application bug

or

2) an application that actually wants a simplex (data flowing in only one direction - in this instance from the CLOSE_WAIT end to what is undoubtedly a FIN_WAIT_2 at the other end) connection

However, both of those presume there is still an application reference to the connection. I would not expect to see a connection in CLOSE_WAIT without a corresponding process. So, that suggests one of two things:

a) lsof missed a reference somewhere

or

b) there is a small bug in the stack and TCP missed a close.

So, I would suggest a perusal of the ITRC patch database to see if any of them discuss CLOSE_WAIT and consider installing the latest transport patch or TOUR for your release.

Of course, checking for the latest patches to the application would be a good idea as well.

If you can conduct it, an interesting test would be to completely shutdown the application. (Just the application, not the system as a whole - I'm assuming that shuting-down the application will bring Java down as well) If the connections go away then, it suggests the application or Java was at fault (which suggests looking for Java patches), and perhaps lsof needs some additional work.
there is no rest for the wicked yet the virtuous have no pillows