cancel
Showing results for 
Search instead for 
Did you mean: 

NFS - lockd hanging

SOLVED
Go to solution
D Heindel
Occasional Advisor

NFS - lockd hanging

Twice over the last week we have had cases where our application has locked up when locking files on a remote system being accessed with NFS. The NFS client server (running the application) is running 11.23, the NFS host server is running 11.31. When this behavior occurs the lockd process on the NFS host becomes unresponsive. It does not respond to any kill commands (including kill -17 + kill -9), so I am unable to turn on logging to get any info about the cause of the problem and the only way to resolve the issue is to reboot the NFS host server.

When this occurred yesterday I did turn on logging on the client server. The results are below (There was a lot of repetition in the file I tried to extract the unique information). Logging was only turned on for a few minutes and the file is huge (~16MB), so I am hesitant to turn on logging full time on the NFS server. But, I can if that will be the only way to diagnosis the problem if this does happen again.

This post (http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1273299695512+28353475&threadId=1052530) references using the â C option for lockd to solve interoperability problems between 11.23 + 11.31 systems. I did turn this on yesterday (by starting and stopping the statd + lockd processes on the client server) when the problem occurred but it did not fix it (I am hoping it is just because the lockd process on the host server was already hosed). Do the problems we are having sound like they would be caused by this â C option not being turned on? Are there other settings we should look at? Is it impossible to tell without logging from the host server?

Additional Facts:
â ¢ We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.
â ¢ The NFS client server has two network cards, so the traffic from the client may be coming from two different IP addresses.

Any help would be much appreciated. Let me know if any additional information would be useful. Thanks.
14 REPLIES
Steven E. Protter
Exalted Contributor

Re: NFS - lockd hanging

Shalom,

HP-UX 11.31, which includes NFS v4 does include a more robust locking mechanism.

This appears to be a user volume issue.

>>
We have had this setup between the servers up and running for around 4 months. Everything was fine until a week ago. We did add additional users a 10 days before the issue first occurred. This roughly doubled the number of connections between the servers.
<<

I would check the back end server for high water marks with glance/gpm, make sure there are enough file handles, maxuprc has a high enough limit of processes for a single user (perhaps the application user is hitting this limit).

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

Unfortunately we do not have glance on the server. Are you coming to the conclusion that it is a number of users issue based on the fact we increased the number of users on the system recently or is something in the log file indicating that?
Dave Olker
HPE Pro
Solution

Re: NFS - lockd hanging

The logging data you collected from the 11.23 client is not enough to tell me what the problem is. The only thing I can say with any degree of certainty looking at this data is that the rpc.lockd on the NFS server is either not replying to the client's lock requests or is replying to the wrong port number on the client because the client's rpc.lockd never gets the reply.

Here's the only real interesting thing in this logfile:
________________________________________

LOCKD QUEUES:
***** granted reclocks *****
***** msg queue *****
(4012a350), oh=ux01 29103, svr=ux05.quadax.net,
fh=400100020000000a000a000000003900673da64f000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=5462873b
proc= 7, (msg reply = NULL)

(4012a2a0), oh=ux01 20030, svr=ux05.quadax.net,
fh=400100020000000a000a00000000203924ad37fb000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=7d4dd4bf
proc= 7, (msg reply = NULL)

(4012a1f0), oh=ux01 15103, svr=ux05.quadax.net,
fh=400100020000000a000a00000000390119b4ad0a000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=6a96ceb6
proc= 7, (msg reply = NULL)

(40129e80), oh=ux01 20967, svr=ux05.quadax.net,
fh=400100020000000a000a000000003c6d37b6ae37000a000000000002000049e0,
op=2, range=[0, 2199023255552], client=ux01, cookie=7a7d7082
proc= 7, (msg reply = NULL)
...


The first couple lines show me there are currently no locks being held by this client because the Granted queue is empty. The Message queue has many entries with a message reply status of NULL. This means the client has sent all these lock requests (and we know they are lock requests because the PROC= 7 maps to NLM_LOCK_MSG in the rpc.lockd protocol) to the server and never received a reply (reply = NULL).

So the client is trying desperately to lock files on the server and the server is either not replying or sending the replies to the wrong place.

The only way to really get a good read on this would be to enable debug rpc.lockd logging on the client and server *AND* collect a network trace during the problem. The lockd logging will tell us what the respective daemons are doing from a locking perspective and the network trace will tell us if the lock requests and replies are being sent to the correct destinations.

You have the correct method of turning on rpc.lockd logging on the 11.23 system (obviously since you sent in debug logging data), but on 11.31 systems it is controlled by a kctune parameter:

# kctune klm_log_level=9

This will enable debug rpc.lockd logging to the console buffer and /var/adm/syslog/syslog.log file.

As for the -C option, there's no harm in adding that to the LOCKD_OPTIONS string of the /etc/rc.config.d/nfsconf file on the 11.23 system. This option doesn't exist on 11.31, so only modify the 11.23 system. That way when rpc.lockd is re-started on the 11.23 system it will have the proper option.

Hope this helps,

Dave
singh sanjeev
Trusted Contributor

Re: NFS - lockd hanging

please check you /etc/fstab ;how the entry of nfs file system are made.. if flexible setting are not used..and network problem occur ..it get hung.

can you show us nfs entry of fstab on your system
Sanjeev Singh
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

Unfortunately this occurred again today. I was able to turn on logging for the client and server. I attempted to use nettl to collect network data, but am not sure if it contains useful data and does contain some sensitive info. Let me know if seeing this data would be useful and I will try to filter it out, or it we should be using a different tool to perform the network trace.

The attached log files are from a time period where the application was hanging. In this situation the server ux01 is the client and ux05 the host (although in some cases ux05 acts as the client and ux01 the server â we have not had problems with this scenario). Again, the client (ux01) is on 11.23 and the server (ux05) 11.31. Is there anything informative to anyone in these files?? I didnâ t have any luck matching up requests and responses.

Below is the fstab files from both the client + server. I donâ t see anything in there about the NFS mounted file systems.

Server:
/dev/vg00/lvol3 / vxfs delaylog 0 1
/dev/vg00/lvol1 /stand vxfs tranflush 0 1
/dev/vg00/lvol4 /tmp vxfs delaylog 0 2
/dev/vg00/lvol5 /home vxfs delaylog 0 2
/dev/vg00/lvol6 /opt vxfs delaylog 0 2
/dev/vg00/lvol7 /usr vxfs delaylog 0 2
/dev/vg00/lvol8 /var vxfs delaylog 0 2
/dev/vg01/lvolharp /harp vxfs defaults 0 2
/dev/vg01/lvolux05 /ux05 vxfs defaults 0 2
/dev/vg01/lvolbkp /bkp vxfs defaults 0 2
/dev/vg01/lvoletl /etl vxfs defaults 0 2
/dev/vg01/lvolpas /pas vxfs defaults 0 2

Client:
/dev/vg00/lvol3 / vxfs delaylog 0 1
/dev/vg00/lvol1 /stand vxfs tranflush 0 1
/dev/vg00/lvol4 /tmp vxfs delaylog 0 2
/dev/vg00/lvol5 /home vxfs delaylog 0 2
/dev/vg00/lvol6 /opt vxfs delaylog 0 2
/dev/vg00/lvol7 /usr vxfs delaylog 0 2
/dev/vg00/lvol8 /var vxfs delaylog 0 2
/dev/vg00/lvol9 /harp vxfs delaylog,datainlog,largefiles,rw,suid 0 2
/dev/vg00/lvol10 /pas vxfs delaylog,datainlog,largefiles,rw,suid 0 2
/dev/vg01/lvol1 /san vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg00/lvol11 /QDX vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg00/lvol12 /doc vxfs rw,suid,nolargefiles,delaylog,datainlog 0 2
/dev/vg00/lvol13 /clock vxfs rw,suid,nolargefiles,delaylog,datainlog 0 2
/dev/vg00/lvswap ... swap pri=0 0 0
/dev/vg03/lvol1 /harptmp vxfs rw,suid,largefiles,delaylog,datainlog 0 2
/dev/vg02/lvol1 /bkp vxfs rw,suid,largefiles,delaylog,datainlog 0 2

D Heindel
Occasional Advisor

Re: NFS - lockd hanging

One thing that I do notice that is different now that the NFS locking is working again is that there are server records in the log file in syslog.log on the host server (ux05). Example below. Am I correct in assuming that the absence of these messages in the log file when there are problems indicate that the requests are not making it to the server?

Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: **** SERVER REQUEST START ****
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: vers= 4, proc= 7
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 savexprt: lm_xprt= e0000001e0c35280 thread= e00000017cdc5280 xprt= e0000001e1376600 next= e0000001e0c35200
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getconfig: fp= e0000001e02e5780
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 above message repeats 3 times
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getconfig: ln= e0000001e0812080
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_sysid: config= e0000001e0812090 addr= e0000001e1376630 name= ux01 alloc= TRUE
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: exclusive= 1, reclaim= 0, state= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: xid= 1275865999
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: caller= ux01, svid= 10663, offset= 0, len= 0
Jun 4 12:58:57 ux05 vmunix: fh= 40 1 0 2 0 0 0 A 0 A 0 0 0 0 44 89 68 50 79 6D 0 A 0 0 0 0 0 2 0 0 49 E0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 alk2flk4: type= 2, start= 0, len= 0, pid= 10663
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_vnode4: cnt= 1, vp= e0000001e6850d00, v_cnt= 1
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: sysid = 0x2, pid = 10663
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_me: addr = ux05.
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_sysid: config= 9fffffff7f4e76f0 addr= 9fffffff7f4e76e0 name= me alloc= TRUE
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 sm_server: client= ux01, sysid= 2, sm_server= 1, me= e0000001e0c5b180
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: before calling VOP_FRLOCK:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: type= 2, start= 0, len= 0, pid= 10663 sysid= 2
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 block_lck4: lbi: ls= e0000001e0c36180 cb= 12 nr.stat= 0 nr.cookie= 311229987
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 send_reply4: ls= e0c36180 cb= 12 xdrres= 17e67c0 nr.stat= 0 nr.cookie= 311229987
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 send_reply4: doing callback 12
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 getxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: Calling [100021, 4, 12] on 'ux01' (2) via 'udp'
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: Found lc= e0000001e0c68380, lc_old= e0000001e0c8d280, timout= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: addr= 0 2 D8 20 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 get_client: End: lc= e0000001e0c68380 status= 0, time= 4c0930a2, client= e0000001e0c7a300, clients= 11
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: addr= 0 2 D8 20 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: RPC_TIMEDOUT
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: RPC stat= 5 error= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 rel_clien: addr = (e0000001e0c37800, 16 16)
Jun 4 12:58:57 ux05 vmunix:
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 callrpc: End: error= 0, tries= 4, tot= 105875, bad= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 relxprt: lx= e0000001e0c35280
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: End: error= 0, tot= 79349, bad= 0
Jun 4 12:58:57 ux05 vmunix: KLM:ab03 nlm_disp4: **** SERVER REQUEST END ****

Dave Olker
HPE Pro

Re: NFS - lockd hanging

Remember from my previous post:

> # kctune klm_log_level=9
>
> This will enable debug rpc.lockd logging
> to the console buffer and
> /var/adm/syslog/syslog.log file.

This is the KLM subsystem logging the lock request processing to syslog as I indicated. When klm_log_level is set to 9 you should see this kind of information whenever the system is dealing with NFS file locks. If you don't see this stuff in the syslog then the system is currently not doing any NFS file locking.

Does this mean things are working now?

Dave
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

I did use that method for enabling logging and there were messages in syslog, but in my uneducated interperation of the log file, they all appeared to be messages where the server was acting as the client instead of the host.

Rebooting the system did fix the problem. After the system was rebooted is when I began seeing messages in syslog where this server seemed to be the host and managing lock requests from the client.

Hope this makes some sense.

Thanks,
Derrick
Dave Olker
HPE Pro

Re: NFS - lockd hanging

Ok, let us know if the problem reoccurs after the reboot. Hopefully if it does you'll have the data needed to troubleshoot it.

Dave
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

We did turn on the logging when we began having problems this morning. Are you saying that if logging is already on when the problem first occurs, you would expect to see some event which causes the lockd process to be unresponsive?

I know this is a very difficult (maybe impossible) question to answer, but I would very much like to avoid this issue occurring again. Based on the information shared already do you have any hunch as to what possible causes of the issue could be? Or any ideas for a way to more gracefully recover without needing to reboot?

If the problem does occur again what would be the most useful tool for gathering network trace data? Nettl, tcpdump, other?

Sorry for being so difficult. I VERY much appreciate the help you have given.

Thanks,
Derrick
Dave Olker
HPE Pro

Re: NFS - lockd hanging

Hi Derrick,

> Are you saying that if logging is already
> on when the problem first occurs, you
> would expect to see some event which
> causes the lockd process to be
> unresponsive?

Exactly. The vast majority of the file locking issues I've worked on fall into two buckets:

1) Locking never worked
2) Locking worked for a while then stopped

In cases where locking has never worked, it's pretty easy to figure out by collecting a debug logfile from both sides and reproducing the problem. It's also very easy to reproduce the problem because locking is broken completely.

Cases where locking works for a while then mysteriously stops are much harder to figure out because it's usually not possible to reproduce them at will. In these situations having the historical data from before the problem began is critical as it helps pinpoint when the trouble began and what sequence of events took the system from a working state to a failing state. This helps us know where to focus our investigation.

> I know this is a very difficult (maybe
> impossible) question to answer, but I
> would very much like to avoid this issue
> occurring again. Based on the information
> shared already do you have any hunch as to
> what possible causes of the issue could
> be?

The data you've provided so far is all post-problem, so all I see is the client sending requests and never getting a response. Not enough information for me to determine root cause.

In the 2nd set of data you attached, I do see the client sending some CANCEL messages, so if rpc.lockd on the 11.23 system is not running with the -C option yet, that's definitely the next thing you should enable.

> Or any ideas for a way to more
> gracefully recover without needing to
> reboot?

Everything I've seen so far is pointing to the server stops responding. Also, the fact that you reboot the 11.31 system and the problem clears up is more justification of this being a server-side issue.

If you enter the following on the 11.31 server:

# /sbin/init.d/lockmgr stop
# /sbin/init.d/lockmgr start

and it doesn't recover gracefully then you may be left with no other recourse than rebooting the server.

> If the problem does occur again what would
> be the most useful tool for gathering
> network trace data? Nettl, tcpdump, other?

Nettl is fine. It captures data in a format that either netfmt or Wireshark can decode.

Regards,

Dave
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

Makes sense. Thanks for all your help. Hopefully I won't have anything new to post because it won't happen again, but I am not optimistic.

Derrick
D Heindel
Occasional Advisor

Re: NFS - lockd hanging

Because this issue seems to be the result of a build up of requests, we tried to prevent it by happening by restarting the lockd process on the host server. This was done using the command:
/sbin/init.d/lockmgr stop
/sbin/init.d/lockmgr start

The statd process died, but the lockmgr process did not. After this was done we began having the same issue again where the client could not lock files. The log file did not show anything interesting, other than SERVER REQUESTS stopped showing up in the file. It did still contain CLIENT REQUEST messages.

Does the fact that the kill command caused the process to stop functioning as a host (but did not kill the process) provide any insight to what our problem could be to anyone?

Last Server request in the log file (the kill command was issued at 22:30):
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 nlm_disp4: **** SERVER REQUEST START ****
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 nlm_disp4: vers= 4, proc= 9
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 getconfig: fp= e0000001e02e5780
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 getconfig: ln= e0000001e0812080
Jun 8 22:00:27 ux05 vmunix:
Jun 8 22:00:27 ux05 above message repeats 3 times
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 get_sysid: config= e0000001e0812090 addr= e0000004861533b0 name= ux01 alloc= TRUE
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 unlock: caller= ux01, svid= 21601, offset= 0, len= 0
Jun 8 22:00:27 ux05 vmunix: fh= 40 1 0 2 0 0 0 A 0 A 0 0 0 0 39 24 26 30 CB D 0 A 0 0 0 0 0 2 0 0 49 E0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 alk2flk4: type= 3, start= 0, len= 0, pid= 21601
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 get_vnode4: cnt= 1, vp= e0000004590d8200, v_cnt= 1
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 unlock4: nlmid = 0, sysid = 0x2, pid = 21601
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 unlock4: End: stat= 0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 callrpc: Calling [100021, 4, 14] on 'ux01' (2) via 'udp'
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 get_client: Found lc= e0000001e0c68380, lc_old= e0000001e0c8d280, timout= 0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 get_client: addr= 0 2 F9 D1 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 get_client: End: lc= e0000001e0c68380 status= 0, time= 4c0ef4f7, client= e000000486153980, clients= 14
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 callrpc: addr= 0 2 F9 D1 A D8 C 3A 0 0 0 0 0 0 0 0 (16)
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 callrpc: RPC_TIMEDOUT
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 callrpc: RPC stat= 5 error= 0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 rel_clien: addr = (e000000120eb7f80, 16 16)
Jun 8 22:00:27 ux05 vmunix:
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 callrpc: End: error= 0, tries= 4, tot= 866334, bad= 0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 nlm_disp4: End: error= 0, tot= 549893, bad= 0
Jun 8 22:00:27 ux05 vmunix: KLM:a1580 nlm_disp4: **** SERVER REQUEST END ****

Thanks,
Derrick

Dave Olker
HPE Pro

Re: NFS - lockd hanging

> we tried to prevent it by happening by
> restarting the lockd process on the host
> server. This was done using the command:
> /sbin/init.d/lockmgr stop
> /sbin/init.d/lockmgr start
>
> Does the fact that the kill command caused
> the process to stop functioning as a host
> (but did not kill the process) provide any
> insight to what our problem could be to
> anyone?

Not me. As I've said, the best way to figure out why this problem happens is to enable logging on the client and server before the problem occurs and then wait for it to happen - on its own, not by killing daemons. That's your best chance to capture the state of both systems when things are working properly and the event that causes one of them to start misbehaving.

Dave