1826579 Members
3762 Online
109695 Solutions
New Discussion

Locking issue on NFS

 
SOLVED
Go to solution
Rikki hinn Ogurlegi
Frequent Advisor

Locking issue on NFS

I've got a 11.23 server and also a 11.23 client.
Both client and server have latest ARPA, NFS, RPC, LAN, NIS and dependancies pathces installed.

branda# swlist BUNDLE
# Initializing...
# Contacting target "branda"...
#
# Target: branda:/
#

# BUNDLE B.2009.03.25 Patch Bundle
BUNDLE.PHCO_38273 1.0 libc cumulative patch
BUNDLE.PHSS_39093 1.0 linker + fdp cumulative patch
BUNDLE.PHSS_38526 1.0 Aries cumulative patch
BUNDLE.PHSS_37958 1.0 LIBCL patch
BUNDLE.PHNE_38973 1.0 CacheFS cumulative patch
BUNDLE.PHNE_38906 1.0 libnsl cumulative patch
BUNDLE.PHNE_38905 1.0 Lock Manager cumulative patch
BUNDLE.PHNE_38904 1.0 AutoFS cumulative patch
BUNDLE.PHNE_38503 1.0 Core NFS cumulative patch
BUNDLE.PHNE_38379 1.0 libnss_dns manpage patch
BUNDLE.PHNE_38254 1.0 NFS kernel tunables patch
BUNDLE.PHNE_38252 1.0 NFS cumulative patch
BUNDLE.PHNE_37897 1.0 cumulative ARPA Transport patch
BUNDLE.PHNE_37490 1.0 NIS/NIS+ cumulative patch
BUNDLE.PHNE_37487 1.0 Kernel RPC cumulative patch
BUNDLE.PHNE_36981 1.0 RPC commands and daemons cumulative patch
BUNDLE.PHNE_36839 1.0 LAN cumulative patch
BUNDLE.PHNE_36575 1.0 Cumulative STREAMS Patch
BUNDLE.PHNE_36215 1.0 libnss_dns DNS backend patch
BUNDLE.PHKL_39356 1.0 ttrace and thread cumulative patch
BUNDLE.PHKL_39349 1.0 VM dma32 memory leak fix,scpool fix
BUNDLE.PHKL_39348 1.0 page cache synchronization and pfdats fix
BUNDLE.PHKL_39343 1.0 VM mmap(2) performance fix, memory leak
BUNDLE.PHKL_39283 1.0 pthread_cond_timedwait,hires timers,callout
BUNDLE.PHKL_39277 1.0 Cumulative kernel SCSI patch
BUNDLE.PHKL_39176 1.0 Kernel libsec cumulative patch
BUNDLE.PHKL_39128 1.0 Lockf Patch
BUNDLE.PHKL_38926 1.0 GIO cumulative patch
BUNDLE.PHKL_38915 1.0 VxFS transaction patch
BUNDLE.PHKL_38902 1.0 JFS3.5 DIO performance; extent rollback
BUNDLE.PHKL_38797 1.0 Panic or hang on Integrity systems
BUNDLE.PHKL_38714 1.0 VxFS 3.5 cumulative patch
BUNDLE.PHKL_38702 1.0 SCTP system call;Itanium-2 and PA-64 Support
BUNDLE.PHKL_38599 1.0 Dynamic Buffer Cache patch
BUNDLE.PHKL_38598 1.0 Buffer Cache patch
BUNDLE.PHKL_38508 1.0 vx_vget
BUNDLE.PHKL_38415 1.0 VM vhand fix
BUNDLE.PHKL_38288 1.0 File descriptor management; voncelocked fix
BUNDLE.PHKL_38287 1.0 VxFS 3.5 mount: Quota;logiosize
BUNDLE.PHKL_37272 1.0 hires timers SANOPROCSIG
BUNDLE.PHKL_37263 1.0 Changes in pm-svc for core cpu id
BUNDLE.PHKL_37261 1.0 Changes in vm-asi for core cpu id
BUNDLE.PHKL_37106 1.0 WSIO IO subsystem cumulative patch
BUNDLE.PHKL_36745 1.0 LVM Cumulative Patch
BUNDLE.PHKL_36577 1.0 PM-PSTAT section 2 manpage changes
BUNDLE.PHKL_36103 1.0 wsio.h header file cumulative patch
BUNDLE.PHKL_35243 1.0 VxFS 3.5 VFS destacking support;DIO hang;bdf
BUNDLE.PHKL_35242 1.0 JFS3.5 bmap performance improvement
BUNDLE.PHKL_35240 1.0 s700_800 11.23 rwsleep locks
BUNDLE.PHKL_34357 1.0 timer interval, hires timers
BUNDLE.PHKL_33990 1.0 VxFS 3.5 : Quota metadata corruption
BUNDLE.PHKL_33930 1.0 pstat maxmem fix with CLM and pfdats fix
BUNDLE.PHKL_31500 1.0 Sept04 base patch
BUNDLE.PHCO_38717 1.0 LVM commands patch


This is ontop of December 2008 Quality Pack Bundles.


There is one application that is running on the client that sometimes works and sometimes not. We do have the sourcecode for this application so I was able to hunt down why it hangs (unkillable) and it was due to locking.

I wrote a small program to demonstrate:

#include
#include
#include
#include

int main (int argc, char *argv[])
{
int fd, status;
struct flock lock;

fd = open (argv[1], O_RDWR);
if (fd == -1) {
printf ("Could not open file: %s, errno: %d\n", argv[1], errno);
exit(1);
}

printf("File successfully opened.\n");

printf("Setting an exclusive lock.\n");
lock.l_type = F_WRLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 0;
lock.l_len = 0;
if (fcntl (fd, F_SETLKW, &lock)) {
printf ("Error returned by fcntl [F_SETLKW], errno: %d\n", errno);
exit(1);
}
printf("Exclusive lock OK.\n");

printf("Releasing lock.\n");
lock.l_type = F_UNLCK;
lock.l_whence = SEEK_SET;
lock.l_start = 0;
lock.l_len = 0;
if (fcntl (fd, F_SETLK, &lock)) {
printf ("Error returned by fcntl [F_UNLCK], errno: %d\n", errno);
exit(1);
}
printf("Exclusive lock Released.\n");

close(fd);
exit(0);
}

Here is what happens... On local disks:

glokollur# ll /tmp/testfile
/tmp/testfile not found
glokollur# touch /tmp/testfile
glokollur# ./locktest /tmp/testfile
File successfully opened.
Setting an exclusive lock.
Exclusive lock OK.
Releasing lock.
Exclusive lock Released.
glokollur#

Now on NFS:

glokollur# mount | grep sgog
/mnt/og on sgog.orkugardur.is:/export/og soft,nodevs,rsize=32768,wsize=32768,NFSv3,dev=16e on Tue May 5 08:34:49 2009

glokollur# ll /og/HP-UX/nfstest/testfile
/og/HP-UX/nfstest/testfile not found
glokollur# touch /og/HP-UX/nfstest/testfile
glokollur# ./locktest /og/HP-UX/nfstest/testfile
File successfully opened.
Setting an exclusive lock.

Here it hangs. Doesnt respond to kill -9 untill after a few minutes.

The parameter used on the fcntl call is F_SETLKW. The man page states:

F_SETLKW This cmd is the same as F_SETLK except that if a
read or write lock is blocked by other locks, the
process will sleep until the segment is free to be
locked.

The file didnt even exist untill seconds before the initial lock attempt so no previous existing locks should have existed.

Sometimes, restarting lockd and statd on the client and/or server works to temporarily fix the problem.

I think I have removed all the regular suspects with NFS locking problems. Host resolving works for all servers involved.
I've got all servers configured to use /etc/hosts before DNS:

glokollur# grep hosts /etc/nsswitch.conf
hosts: files dns

and the hosts file the servers share has all the hosts listed. All servers are also in DNS.

Can anyone offer suggestions on why I'm having this problem ?
6 REPLIES 6
Rikki hinn Ogurlegi
Frequent Advisor

Re: Locking issue on NFS

I forgot to mention that I'm running lockd and statd on the client with -l (logging) and there are no errors in either log.
Steven E. Protter
Exalted Contributor

Re: Locking issue on NFS

Shalom,

NFS below version 4, which is only available for HP-UX 11.31 has a locking mechanism that is, I'd describe as imperfect.

I would check the system high water mark on nfiles and other parameters if you have glance/gpm installed.

Locks probably are not be released properly and the system is running out. That is why restarting the lockd daemon seems to help.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Rikki hinn Ogurlegi
Frequent Advisor

Re: Locking issue on NFS

Glance is indeed on both servers. nflocks, nfile and so on are far from full (in the single digit percentages).

How reliable is NFS V4 file locking SEP ? Can I count on it to fix my problem? It would be such a waste to spend all the effort upgrading only to end up in the same place as before.

Also, Wasnt there some ONC+ package available ?
Dave Olker
Neighborhood Moderator
Solution

Re: Locking issue on NFS

Before we start jumping to upgrading OS's and NFS versions, let's start with some basic troubleshooting.

The first thing to collect on both systems is a debug rpc.lockd logfile while you're reproducing the problem. On the NFS client and server issue this command:

# ps -ef | grep rpc.lockd
# kill -17

That will turn on debug logging. Then run your test program to reproduce the file lock hang, then send another kill -17 to the same rpc.lockd pids to toggle the logging off.

The logging data is written to the /var/adm/rpc.lockd.log file. You can either post the client and server log files here to the ITRC or you can send them to me (dave.olker@hp.com) and I'll take a quick look at what they indicate.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Rikki hinn Ogurlegi
Frequent Advisor

Re: Locking issue on NFS

Dave, I just sent you the logs you requested via email. They where a bit large for the forums, I think.
Dave Olker
Neighborhood Moderator

Re: Locking issue on NFS

Hi Richard,

I looked at the data. This looks to be a server-side problem because the client is merely sending a lock request and never hearing back from the server. Unfortunately, by the time logging is started on the server side things are already in a wedged state.

My suggestion would be to terminate the rpc.lockd and rpc.statd daemons on both systems and re-start the daemons with logging enabled so I can see the entire history of events leading up to the failure. It would also help to have a network trace of the failing lock request.

So, what you'd want to do is:

1. Terminate the running rpc.statd and rpc.lockd

kill $(ps -e | grep rpc.lockd | awk '{print $1}')
kill $(ps -e | grep rpc.statd | awk '{print $1}')

2. Restart rpc.statd and rpc.lockd with Debug Logging Enabled

# /usr/sbin/rpc.statd -d3 -l /var/adm/rpc.statd.log &
# /usr/sbin/rpc.lockd -s3 -d3 -l /var/adm/rpc.lockd.log &

3. Wait 50 seconds for the rpc.lockd grace period to expire

4. Enable nettl tracing

# nettl -tn pduin pduout -e ns_ls_ip -s 512 -tm 99999 -f lcktrace

5. Reproduce the problem

6. Once you have successfully reproduced the problem, disable nettl tracing

# nettl -tf -e all

NOTE: This will create at least one file called lcktrace.TRC000. Depending on the network traffic at the time of the duplication, there could also be a second file called lcktrace.TRC001. Be sure to check for both files.

7. Send the running rpc.lockd/rpc.statd daemons the SIGUSR2 signal to disable debug logging

# kill -SIGUSR2 $(ps -e | grep rpc.statd | awk '{print $1}')
# kill -SIGUSR2 $(ps -e | grep rpc.lockd | awk '{print $1}')


By the end you should have an rpc.lockd and rpc.statd debug logfile from both the NFS client and server as well as a network trace. If you can collect this stuff and send it to me I'll have a look.

Dave



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo