topic Re: FIN_WAIT_2 / CLOSE_WAIT in Operating System - HP-UX

FIN_WAIT_2 / CLOSE_WAIT

Guenter Lehmann — Fri, 08 Dec 2006 06:54:54 GMT

Backup software EMC (Legato) Networker is running on a rx4640 / HP-UX 11.23. Every 5th or 6th day we see the message "Too many open files" in the Networker logfile. No more Backups are possible. We have to restart Networker.
After restart we see less sockets in CLOSE_WAIT. After 5 or 6 days running we see more than 2000 sockets in CLOSE_WAIT / FIN_WAIT_2. What we find out is that there are more than 1000 socket pairs. One connection in CLOSE_WAIT the other in FIN_WAIT_2 state. All these sockets are open by only one user process (nsrjobd).
tcp 0 0 localhost.50002 localhost.50001
FIN_WAIT_2
tcp 0 0 localhost.50001 localhost.50002 CLOSE_WAIT
..............
tcp 0 0 localhost.50621 localhost.50620 FIN_WAIT_2
tcp 0 0 localhost.50620 localhost.50621 CLOSE_WAIT
................
We changed the following parameters
tcp_time_wait_interval 60000
tcp_conn_request_max 4096
tcp_ip_abort_interval 60000
tcp_keepalive_interval 900000
but this helps nothing.
We belive that there is a application bug but Legato Support is at a loss.
Are there any ideas what we can do to close these sockets.

Re: FIN_WAIT_2 / CLOSE_WAIT

Heironimus — Fri, 08 Dec 2006 16:17:38 GMT

I agree that it's probably an application bug, but all too often we end up fixing application bugs with band-aids on the system....

Have you experimented (carefully) with the tcp_fin_wait_2_timeout parameter? It's specific to the FIN_WAIT_2 state, so it probably won't help you with CLOSE_WAIT. But I think both of those could be caused by the same kind of application error on opposite ends of the connection.

Re: FIN_WAIT_2 / CLOSE_WAIT

rick jones — Mon, 11 Dec 2006 12:37:32 GMT

FIN_WAIT_2 means that end of the connection has sent a FINished segment, and it has been ACKnowledged by the "remote" TCP. This end of the connection is now waiting for a FIN from the remote, hence FIN_WAIT_2 (FIN_WAIT_1 is when we are waiting for an ACK of our FIN)

When the FINished segment arrived, the socket associated with that end of the connection would have become "readable" and a read/recv against the socket would have returned zero to indicate to the application that the remote had said (at least) it would be sending no more data.

Unless the connection is supposed to remain up as a "simplex" (unidirectional to the end which sent the FIN), the next logical step is for the application to call close. Hence this side goes into CLOSE_WAIT state - we are waiting for this side to call close().

So, 99 times out of ten what happens is either the application has "ignored" or "forgotten" the read return of zero, or it has forked and forgotten to clean-up a dangling file descriptor reference.

The FIN_WAIT_2 timer is a massive kludge. 99 times out of ten I wish it wasn't there because it is used to cover the backside of fundamentally broken applications which have bugs which never should have left the lab.

If you want to close the sockets, kill the processes.

FWIW, none of the original ndd settings in the base post would have any effect on this - tcp_time_wait_interval is just for TIME_WAIT, tcp_conn_request_max control the max depth of a listen queue, tcp_ip_abort_interval is for how long we wait for an ACK of data, and tcp_keepalive_interval is just for sockets that set SO_KEEPAIVE. There is tcp_keepalive_detached_interval, but that is for catching situations where we are in FIN_WAIT_2 and the remote connection is just _gone_ not simply sitting in CLOSE_WAIT.

So, hold Legato's feet to the fire and make them find and fix what is 99% likely to be their bug. If you want to try to "catch" it, you could consider starting to take tusc traces - although doing so from startup could result in some rather long trace files... To be complete, there is a < 1% chance it is a bug in the stack failing to notify, but the chances of that are epsilon.