Re: NFS - lockd hanging

D Heindel · ‎06-03-2010

Twice over the last week we have had cases where our application has locked up when locking files on a remote system being accessed with NFS. The NFS client server (running the application) is running 11.23, the NFS host server is running 11.31. When this behavior occurs the lockd process on the NFS host becomes unresponsive. It does not respond to any kill commands (including kill -17 + kill -9), so I am unable to turn on logging to get any info about the cause of the problem and the only way to resolve the issue is to reboot the NFS host server.

When this occurred yesterday I did turn on logging on the client server. The results are below (There was a lot of repetition in the file I tried to extract the unique information). Logging was only turned on for a few minutes and the file is huge (~16MB), so I am hesitant to turn on logging full time on the NFS server. But, I can if that will be the only way to diagnosis the problem if this does happen again.

This post (http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1273299695512+28353475&threadId=1052530) references using the â C option for lockd to solve interoperability problems between 11.23 + 11.31 systems. I did turn this on yesterday (by starting and stopping the statd + lockd processes on the client server) when the problem occurred but it did not fix it (I am hoping it is just because the lockd process on the host server was already hosed). Do the problems we are having sound like they would be caused by this â C option not being turned on? Are there other settings we should look at? Is it impossible to tell without logging from the host server?

Additional Facts:
â ¢ We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.
â ¢ The NFS client server has two network cards, so the traffic from the client may be coming from two different IP addresses.

Any help would be much appreciated. Let me know if any additional information would be useful. Thanks.

Steven E. Protter · ‎06-03-2010

Shalom,

HP-UX 11.31, which includes NFS v4 does include a more robust locking mechanism.

This appears to be a user volume issue.

>>
We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.
<<

I would check the back end server for high water marks with glance/gpm, make sure there are enough file handles, maxuprc has a high enough limit of processes for a single user (perhaps the application user is hitting this limit).

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

D Heindel · ‎06-03-2010

Unfortunately we do not have glance on the server. Are you coming to the conclusion that it is a number of users issue based on the fact we increased the number of users on the system recently or is something in the log file indicating that?

Dave Olker · ‎06-03-2010

The logging data you collected from the 11.23 client is not enough to tell me what the problem is. The only thing I can say with any degree of certainty looking at this data is that the rpc.lockd on the NFS server is either not replying to the client's lock requests or is replying to the wrong port number on the client because the client's rpc.lockd never gets the reply.

Here's the only real interesting thing in this logfile:
________________________________________

LOCKD QUEUES:
***** granted reclocks *****
***** msg queue *****
(4012a350), oh=ux01 29103, svr=ux05.quadax.net,
fh=400100020000000a000a000000003900673da64f000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=5462873b
proc= 7, (msg reply = NULL)

(4012a2a0), oh=ux01 20030, svr=ux05.quadax.net,
fh=400100020000000a000a00000000203924ad37fb000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=7d4dd4bf
proc= 7, (msg reply = NULL)

(4012a1f0), oh=ux01 15103, svr=ux05.quadax.net,
fh=400100020000000a000a00000000390119b4ad0a000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=6a96ceb6
proc= 7, (msg reply = NULL)

(40129e80), oh=ux01 20967, svr=ux05.quadax.net,
fh=400100020000000a000a000000003c6d37b6ae37000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=7a7d7082
proc= 7, (msg reply = NULL)
...

The first couple lines show me there are currently no locks being held by this client because the Granted queue is empty. The Message queue has many entries with a message reply status of NULL. This means the client has sent all these lock requests (and we know they are lock requests because the PROC= 7 maps to NLM_LOCK_MSG in the rpc.lockd protocol) to the server and never received a reply (reply = NULL).

So the client is trying desperately to lock files on the server and the server is either not replying or sending the replies to the wrong place.

The only way to really get a good read on this would be to enable debug rpc.lockd logging on the client and server *AND* collect a network trace during the problem. The lockd logging will tell us what the respective daemons are doing from a locking perspective and the network trace will tell us if the lock requests and replies are being sent to the correct destinations.

You have the correct method of turning on rpc.lockd logging on the 11.23 system (obviously since you sent in debug logging data), but on 11.31 systems it is controlled by a kctune parameter:

# kctune klm_log_level=9

This will enable debug rpc.lockd logging to the console buffer and /var/adm/syslog/syslog.log file.

As for the -C option, there's no harm in adding that to the LOCKD_OPTIONS string of the /etc/rc.config.d/nfsconf file on the 11.23 system. This option doesn't exist on 11.31, so only modify the 11.23 system. That way when rpc.lockd is re-started on the 11.23 system it will have the proper option.

Hope this helps,

Dave

I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

singh sanjeev · ‎06-03-2010

please check you /etc/fstab ;how the entry of nfs file system are made.. if flexible setting are not used..and network problem occur ..it get hung.

can you show us nfs entry of fstab on your system

Sanjeev Singh

D Heindel · ‎06-04-2010

Unfortunately this occurred again today. I was able to turn on logging for the client and server. I attempted to use nettl to collect network data, but am not sure if it contains useful data and does contain some sensitive info. Let me know if seeing this data would be useful and I will try to filter it out, or it we should be using a different tool to perform the network trace.

The attached log files are from a time period where the application was hanging. In this situation the server ux01 is the client and ux05 the host (although in some cases ux05 acts as the client and ux01 the server â we have not had problems with this scenario). Again, the client (ux01) is on 11.23 and the server (ux05) 11.31. Is there anything informative to anyone in these files?? I didnâ t have any luck matching up requests and responses.

Below is the fstab files from both the client + server. I donâ t see anything in there about the NFS mounted file systems.

Server:
/dev/vg00/lvol3 / vxfs delaylog 0 1
/dev/vg00/lvol1 /stand vxfs tranflush 0 1
/dev/vg00/lvol4 /tmp vxfs delaylog 0 2
/dev/vg00/lvol5 /home vxfs delaylog 0 2
/dev/vg00/lvol6 /opt vxfs delaylog 0 2
/dev/vg00/lvol7 /usr vxfs delaylog 0 2
/dev/vg00/lvol8 /var vxfs delaylog 0 2
/dev/vg01/lvolharp /harp vxfs defaults 0 2
/dev/vg01/lvolux05 /ux05 vxfs defaults 0 2
/dev/vg01/lvolbkp /bkp vxfs defaults 0 2
/dev/vg01/lvoletl /etl vxfs defaults 0 2
/dev/vg01/lvolpas /pas vxfs defaults 0 2

Client:
/dev/vg00/lvol3 / vxfs delaylog 0 1
/dev/vg00/lvol1 /stand vxfs tranflush 0 1
/dev/vg00/lvol4 /tmp vxfs delaylog 0 2
/dev/vg00/lvol5 /home vxfs delaylog 0 2
/dev/vg00/lvol6 /opt vxfs delaylog 0 2
/dev/vg00/lvol7 /usr vxfs delaylog 0 2
/dev/vg00/lvol8 /var vxfs delaylog 0 2
/dev/vg00/lvol9 /harp vxfs delaylog,datainlog,largefiles,rw,suid 0 2
/dev/vg00/lvol10 /pas vxfs delaylog,datainlog,largefiles,rw,suid 0 2
/dev/vg01/lvol1 /san vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg00/lvol11 /QDX vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg00/lvol12 /doc vxfs rw,suid,nolargefiles,delaylog,datainlog 0 2
/dev/vg00/lvol13 /clock vxfs rw,suid,nolargefiles,delaylog,datainlog 0 2
/dev/vg00/lvswap ... swap pri=0 0 0
/dev/vg03/lvol1 /harptmp vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg02/lvol1 /bkp vxfs rw,suid,largefiles,delaylog,datainlog 0 2

D Heindel · ‎06-04-2010

One thing that I do notice that is different now that the NFS locking is working again is that there are server records in the log file in syslog.log on the host server (ux05). Example below. Am I correct in assuming that the absence of these messages in the log file when there are problems indicate that the requests are not making it to the server?

Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: **** SERVER REQUEST START ****
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: vers= 4, proc= 7
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 savexprt: lm_xprt= e0000001e0c35280 thread= e00000017cdc5280 xprt= e0000001e1376600 next= e0000001e0c35200
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getconfig: fp= e0000001e02e5780
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 above message repeats 3 times
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getconfig: ln= e0000001e0812080
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_sysid: config= e0000001e0812090 addr= e0000001e1376630 name= ux01 alloc= TRUE
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: exclusive= 1, reclaim= 0, state= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: xid= 1275865999
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: caller= ux01, svid= 10663, offset= 0, len= 0
Jun 4 12:58:57 ux05 vmunix: fh= 40 1 0 2 0 0 0 A 0 A 0 0 0 0 44 89 68 50 79 6D 0 A 0 0 0 0 0 2 0 0 49 E0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 alk2flk4: type= 2, start= 0, len= 0, pid= 10663
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_vnode4: cnt= 1, vp= e0000001e6850d00, v_cnt= 1
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: sysid = 0x2, pid = 10663
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_me: addr = ux05.
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_sysid: config= 9fffffff7f4e76f0 addr= 9fffffff7f4e76e0 name= me alloc= TRUE
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 sm_server: client= ux01, sysid= 2, sm_server= 1, me= e0000001e0c5b180
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: before calling VOP_FRLOCK:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: type= 2, start= 0, len= 0, pid= 10663 sysid= 2
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: lbi: ls= e0000001e0c36180 cb= 12 nr.stat= 0 nr.cookie= 311229987
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 send_reply4: ls= e0c36180 cb= 12 xdrres= 17e67c0 nr.stat= 0 nr.cookie= 311229987
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 send_reply4: doing callback 12
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: Calling [100021, 4, 12] on 'ux01' (2) via 'udp'
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: Found lc= e0000001e0c68380, lc_old= e0000001e0c8d280, timout= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: addr= 0 2 D8 20 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: End: lc= e0000001e0c68380 status= 0, time= 4c0930a2, client= e0000001e0c7a300, clients= 11
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: addr= 0 2 D8 20 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: RPC_TIMEDOUT
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: RPC stat= 5 error= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 rel_clien: addr = (e0000001e0c37800, 16 16)
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: End: error= 0, tries= 4, tot= 105875, bad= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 relxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: End: error= 0, tot= 79349, bad= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: **** SERVER REQUEST END ****

Dave Olker · ‎06-04-2010

Remember from my previous post:

> # kctune klm_log_level=9
>
> This will enable debug rpc.lockd logging
> to the console buffer and
> /var/adm/syslog/syslog.log file.

This is the KLM subsystem logging the lock request processing to syslog as I indicated. When klm_log_level is set to 9 you should see this kind of information whenever the system is dealing with NFS file locks. If you don't see this stuff in the syslog then the system is currently not doing any NFS file locking.

Does this mean things are working now?

Dave

I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

D Heindel · ‎06-04-2010

I did use that method for enabling logging and there were messages in syslog, but in my uneducated interperation of the log file, they all appeared to be messages where the server was acting as the client instead of the host.

Rebooting the system did fix the problem. After the system was rebooted is when I began seeing messages in syslog where this server seemed to be the host and managing lock requests from the client.

Hope this makes some sense.

Thanks,
Derrick

Dave Olker · ‎06-04-2010

Ok, let us know if the problem reoccurs after the reboot. Hopefully if it does you'll have the data needed to troubleshoot it.

Dave

I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: NFS - lockd hanging

NFS - lockd hanging