Re: TCP requests dropped due to full queue

D. Toth · ‎04-09-2006

Ok. Hi all. This is my first post so I am sorry for anything I do wrong.

I work for a field operations group that sets up and tears down HP-UX 11i networks by the month. But for years now we have a problem with one application that operates over NFS. The application is just a GUI that reads and writes to a flat DB file. It updates by reading the file every 30 seconds (I believe). So here is what happens:

As more and more computers mount to the NFS server there will come a time when the application on one of the systems hangs.
bdf will hang when trying to access this mount point. When I go to the server and look at the TCP statistics with 'netstat -ptcp' I see that there are many "connection requests dropped due to full queue". I thought that I had come up with a hard number of 51 active nfsd sockets,
as shown by 'netstat | grep nfsd' when the problems occurs. But recently the problem has come up with fewer nsfd connections. These connections can be "ESTABLISHED" , "CLOSE_WAIT" or "FIN_WAIT..."

Originally all the mounts were NFSv3, so I tried to make all of them NFSv2 but that did not help either. I have many other servers, all with the same OS load, with many more NFS mounts to them that do not have this problem.

I have read many posts on how the applications call to listen() may not have enough 'backlog' space, but I can not see how this would relate to an app that is just reading a file. It has no idea it is over NFS.

I tried increasing the connection_requests_max but that just means that the queue takes longer to fill before the stack throws requests away.

I did an experiment last year were I watched the sockets being established then wathced them go through all their states until they were cleared. NFSv2 took about 5mins and NFSv3 never cleared. That was why I initially thought it was an NFSv3 issues. But, now that I am only using v2 on that systems, and still have the problem, I have ruled that out.

I have used many combinations of nfsstat, rpcinfo and netstat and still can not find something that tells me why tcp is dropping all these requests. I tried increasing nfsd, biod (although I have read that setting biod to 0 on the client mught be better) and neither helped. But I still think it is a server issue.

Today I discovered nfsstat -m. That shows me what version of NFS a mount is using and what protocol is being used. I found that all the nfs mounts that showed statistics from nfsstat -m said they were using udp. All the ones that had no statistics showed tcp. I thought that under NFSv3, tcp was tried first and if a connection could not be made then NFS tried udp. So I am confused on that one.

If anyone has made it this far in the post and thinks they have any ideas please let me know.
I would really like to know if there is a way to find out what is leaving the server tcp queue. If it is filling up, how do I find out what is in it and what the processes that are suppose to be servicing it are doing????!?!!?

Anyways, that was many years of ranting. Sorry.

Thanks.

Sung Oh · ‎04-10-2006

Hi Toth,

You can increase max number of kernel thread from NFS server side.

increase "max_thread_proc" value.

Best Regards,

Sung

D. Toth · ‎04-10-2006

Is there a way to find out if the nfsd has run out of threads? Would this be logged anywhere ? ps | grep nfsd shows 20 daemonds (as set in nfsconf) and netstat | grep nfsd shows 33 active sockets (.nfsd). So how many sockets can each nfsd process service? Are the processes tied to each "ESTABLISHED" connection or do they just accept requests in order from all connections ?

Thanks.

rick jones · ‎04-10-2006

Indeed, increasing tcp_conn_request_max is one response to connections dropped due to full queue - the actual queue will be the minimum of what your application passes in its listen() call and what you have set tcp_conn_request_max.

If you bump it and you still get connection requests being dropped, it suggests that your application is being prevented from calling accept() with sufficient frequency. Perhaps the time it takes to write() to the NFS-mounted DB file is at issue.

There is a "queue" per listen socket. It used to be possible on 10.20 to see which specific listen endpoints were filling but that information isn't in netstat with 11.0 and later. It would require unpublished internals knowledge. Still, probably worth submitting an ER via the RC.

Ah, I just went back and saw where your application is not accepting TCP connections, it is simply writing to a file. Then you can mostly disregard the above :) :( :)

Is there anything else besides NFS happening on the server? NFS mounts _should_ be fairly static - when over TCP the connections shouldn't be coming and going with any great rapidity. I don't know that an NFS server will initiate a connection close, I thought it was the client, but then the server has to deal with dead clients somehow. Just what is the rate of TCP connection establishment and tear-down?

Perhaps the disc(s) serving the filesystem(s) being exported by the NFS server are getting saturated and the nfsd's et al are backing-up on that?

there is no rest for the wicked yet the virtuous have no pillows

D. Toth · ‎04-12-2006

Thank you Sung and Rick for responding (I will figure out this whole point reward thing soon).

I found an old HP Document that seems to reference my exact problem, and the date is appropriate also (we are using a 3rd party vendor load which is the Feb.2001 release of 11i). It is titled "NFS Performance tuning for HPUX 11.0 and 11i systems" and is dated July, 2002 (written by Dave Olker) .On page 118 it stated that there was a bug in the initial release of the 11i NFS server that makes the server stop responding to nfs/tcp requests after the max number of threads has been reached. It also references an "NFS Patch" that was released "in the Summer of 2001" to fix this. I have searched 'NFS' in the ITRC patch section and can not find one that directly addresses this problem.

Rick, (and I am assuming that the hp logo in the forum means you work for HP) is there any way you can do a search from your side and see if you can find it. There was one in Sept. of 2001 that talks about NFS deadlock and one from Apr.2002 about threads and NFS but niether references the tcp issue.

To Sung, there was a comment on the same page that a work around is to increase max_thread_proc so thank you for that suggestion. I rebuilt the kernal today and we will see what happens.

Another question if I may. Every mount to this server is done as follows:
mount -o vers=2 ... ...
BUT, the mount takes 75 seconds to complete and on the server I see 6 TCP connection requests dropped. Then, once the server times out, the connections is made with UDP. I thought that NFSv2 was only over UDP and only v3 tried TCP first, then dropped down to UDP. Why is "mount -o vers=2" trying v3 first ?

Thank you.

Ermin Borovac · ‎04-12-2006

I believe that fix for NFS/TCP thread exhaustion problem was fixed in PHNE_22642 for 11.0 and PHNE_23502 for 11.11 systems. PHNE_23502 is quite old now so you are better off to install latest HP recommended NFS patch which is PHNE_32477.

( SR:8606167053 CR:JAGad36339 )
An NFS/TCP client operation receives "NFS server not
responding still trying" messages while attempting to access
the server, even though the server system is up. The server
displays "vmunix: WARNING: tcpd_thread_create: thread_create
failed: 11" messages in /var/adm/syslog/syslog.log.

The patch with the fix does two things:

(1) prints the warning "vmunix: WARNING: tcpd_thread_create: thread_create failed: 11" in syslog.log when max_thread_proc limit is reached
(2) nfsktcpd doesn't stop servicing requests

This means that even with the patch installed (on your NFS server and all clients) you still need to adjust max_thread_proc (per-process limit) and possibly nkthread (system-wide limit).

rick jones · ‎04-13-2006

TCP vs UDP for the mounts is independent of NFSv2 vs NFSv3 for the protocol of the NFS messages exchanged, so asking for a version2 mount does not necessarily imply it will also ask for UDP.

It just happens that HP-UX didn't include support for NFS over TCP until it included support for NFS v3. IIRC.

As for the patches, I'd be inclined to simply install whatever you found for latest NFS client and/or server patches - and their dependencies of course :)

there is no rest for the wicked yet the virtuous have no pillows

Jeff Schussele · ‎04-13-2006

Anybody seen Dave Olker?
Must be on vacation - this is his domain - for sure.

NPP,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

D. Toth · ‎04-15-2006

Thank you all.

I am currently examining all the patches that relate to NFS for 11.11 for dependencies.
But why is it that if I search the Patch database for HPUX on 700s with 11.11 I do not get 23502 in the list?? There are other patches in the list that have been superseded.
As a note, since I have changed the max_thread_proc kernal value I have not had any problems. I hoping this is not just because it takes longer to tie up 256 threads than 64.

D. Toth · ‎04-16-2006

Ermin (and all);

I found PHNE_24909 which makes three references to my issues:

1. SR 8606168123, JAGad37405 which talks about sockets stuck in the CLOSE_WAIT state (This I see a lot of).
2. SR 8606167053, JAGad36339 which talks about the inability of the nfsktcpd to create new threads, and
3. SR 8606144478 JAGad13818 which refernces clients sending FIN signals, but nothing happens.

UNFORTUNATELY, this patch is for 11.00!

In PHNE_23502 (which is for 11.11) there is only a refernce to number 2 above. How can I search the patch database for these SR numbers? I would really like to find out the number of the patch for 11.11 that references number 1 above. Is there some other place (on the web) that you can search though the full patch details?

Thanks again all.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: TCP requests dropped due to full queue

TCP requests dropped due to full queue