Operating System - HP-UX
1833436 Members
3095 Online
110052 Solutions
New Discussion

rpc.lockd and "-C" option

 
SOLVED
Go to solution
Pyers Symon
Advisor

rpc.lockd and "-C" option

In dave Olker's NFS book an incompatibility between HP's rpc.lockd and those NLM's derived from Sun's code is discussed. (Broadly speaking a dead-lock could occur due to the misreading of an acknowlegement of a lock-cancel messge). In the book Olker states that if present in a mixed Unix environment rpc.lockd should be started with the -C option. This is the ONLY reference to this problem I have seen .... What is the current position?
14 REPLIES 14
Peter Godron
Honored Contributor

Re: rpc.lockd and "-C" option

Pyers,
how good is your Taiwanise ?

Seems there is a doc:
http://www1.itrc.hp.com/service/cki/docDisplay.do?docLocale=zh_TW&docId=200000075006795

4000033988A
HP-UX NFS: rpc.lockd -C option required for nlm_cancel requests

Linked from:
http://www.hp.com.tw/ssn/unix/0506/unix050606.asp

Perhaps somebody with access to Americas/Asian Pacific access can post?

Dave Olker
Neighborhood Moderator

Re: rpc.lockd and "-C" option

Hi Pyers,

I might know something about this issue...

I'm not sure what you mean by the "current position". Are you asking if this is still a problem? Are you asking for more details about what the -C option does and when you should use it? Are you asking if we have plans to change this behavior?

Please clarify your question and I'll be happy to answer it.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Pyers Symon
Advisor

Re: rpc.lockd and "-C" option

Only the best for this thread! We use a number of 11.11 & 11.23 boxes accessing data from a bunch of NetApps filers. I am in discussion with colleagues who are trying to get information out of NetApps as to the etymology of their NFS. Until that happens, is it better to fail-safe and use the -C flag or carry on as we are? Also what would be the symptoms of a repeated lock request as you describe? Application hang? My comment about the current position is whether the -C flag is now the default with HP's rpc.lockd (which to be fair I doubt)? Ps Great book (although it was tricky to find in the UK!)
Dave Olker
Neighborhood Moderator
Solution

Re: rpc.lockd and "-C" option

Hi Pyers,

> Only the best for this thread!

HA HA HA!! Nice of you to say. :)


> We use a number of 11.11 & 11.23 boxes
> accessing data from a bunch of NetApps
> filers. I am in discussion with
> colleagues who are trying to get
> information out of NetApps as to the
> etymology of their NFS. Until that
> happens, is it better to fail-safe and
> use the -C flag or carry on as we are?

I am not certain of the etymology of NetApp's NFS implementation (i.e. did they get it from Sun or write it themselves) but I would be absolutely shocked if they, or any other vendor, had the same bug in their code that we managed to add to ours. :(

To be clear, this is a bug in HP's implementation. We coded the rpc.lockd procedure incorrectly in that we misinterpretted a reply code from the server.

The problem happens when there are competing processes vying for a blocked lock on a file. If processA grabs the lock and processB tries to get the lock and blocks waiting for processA to release it, but then processB gets tired of waiting and tries to cancel the lock - that's when the problem happens.

The HP-UX client sends an NLM_CANCEL_MSG to the server, the server correctly cancels the lock and sends back a reply with status of 0 - as it should. The problem is our client misinterprets the 0 reply to mean "I didn't cancel your lock" so it immediately sends another NLM_CANCEL_MSG. The server replies with another 0 saying "I already cancelled your stupid lock, now leave me alone". The client sees the 0 and says "Crap, it still hasn't cancelled my lock. I'd better send another NLM_CANCEL_MSG". This goes on for some time until someone takes pity on the application or rpc.lockd and shoots them in the head.


> Also what would be the symptoms of a
> repeated lock request as you describe?
> Application hang?

Yes. The application that is trying to cancel the blocked lock will hang while rpc.lockd sends NLM_CANCEL_MSG requests in an endless attempt to get rid of the lock - all the time not realizing the lock was cancelled the first time it sent the request.


> My comment about the current position is
> whether the -C flag is now the default
> with HP's rpc.lockd (which to be fair I
> doubt)?

As of the current 11.23 code we are still using the "incorrect" behavior by default and only do the "correct" thing when the -C option is used. We chose to do this so as not to introduce an incompatibility with our own systems by changing behavior mid-release.

We are planning on changing over to the correct behavior at 11.31 (i.e. 11i v3). What this means is any system running 11.31 and interoperting with pre-11.31 systems will potentially cause problems unless those pre-11.31 systems are using the -C option.

Again, we didn't want to introduce a situation where things were working fine before you update to a newer HP-UX release and then suddenly you start seeing application hangs because we fixed a behavioral problem with rpc.lockd. However, enough time has passed that we feel it's better to do the right thing by default and clearly document the fact that we're changing it.


> Ps Great book (although it was tricky to
> find in the UK!)

I'm glad you found the book helpful. :)

Feel free to contact me if you ever have any other questions about HP's NFS offering. My email is listed prominently in my forum profile.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

I believe I am having this problem as well; running nfs.client on 11.23 HPUX connecting to a Linux nfs system (Redhat es3).

Where exactly do you put the -C parameter. In the /etc/rc.config.d/nfsconf file as a LOCKD_OPTIONS??

Thank you for well described condition by the way.
Dave Olker
Neighborhood Moderator

Re: rpc.lockd and "-C" option

Hi Mike,

Yes, you would put this option in the LOCKD_OPTIONS line of /etc/rc.config.d/nfsconf file, as follows:

LOCKD_OPTIONS="-C"

This will cause rpc.lockd to start with this option the next time the system is booted. However, it won't affect the currently running rpc.lockd daemon. If you want to get the new behavior now, I'd suggest the following:

1) kill rpc.statd and rpc.lockd
2) /usr/sbin/rpc.statd
3) /usr/sbin/rpc.lockd -C

You always want to start rpc.statd before starting rpc.lockd. If you do this quickly you should see minimal disruption of service from rpc.lockd.

Hope this helps. Let me know if you have any other questions about rpc.lockd or anything else NFS.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Hi Dave, this really looked like my issue, but the -C option did not seem to help. Let me explain a little further.

I have two HPUX 11.23 server using the same NFS mount on a Rhel es3 server. This process was working fine, until recently when working with HP support on another issue, they suggested installing the latest 11.23 patches, which went fine but did not solve my issue. Then it was suggested the PHCO_37228 patch be installed.

Since installing the 37228 patch, and even after uninstalling it, now I'm having problems with the nfs mount, specifically the rpc.lockd.

Which ever server touches it first seems to win, and then the other server can't get any locks at all.

I would appreciate any suggestion at this point.

Thanks.

Mike
Dave Olker
Neighborhood Moderator

Re: rpc.lockd and "-C" option

Hi Mike,

It sounds like you're speaking in code here. Can you be specific about the exact problem you're having and why HP Support suggested a libc patch to resolve it?

Also I don't understand your explanation of the current problem:
_________________________

Which ever server touches it first seems to win, and then the other server can't get any locks at all.
_________________________


What does this mean? Are the two servers running the same application and which ever system starts the application first gets the locks? I don't seem to understand exactly what the problem is.

If you can explain exactly what is happening (actual syntax or error messages are helpful) then I can try to point you in the right direction to find the root cause.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Sorry Dave, I've been fighting this for about 3 days now. I'll try to be more specific.

We are running an ERP system (written with Cognos PowerHouse tools) on the 2 HPUX servers, that need to share a couple of NFS mounts in order to run, basically like a cluster. The system started having problems with getcwd returning an incorrect response. After the latest 11.23 patches were installed. Then the HP tech suggested the 37228 patch. We were not having any NFS problems at the time, but the libc patch was specifically for getcwd problems.

Both 11.23 systems need access to some common files; one of which is temporary locked when a user logs in. Since it is only a brief lock, this was not been a problem.

I am not seeing any errors in either the syslog, or lockd log file. But which ever of the two ERP systems touches the common file locked during login first, can continue to lock the file as users come and go. And the other server can no longer get a lock on that file at all.

Some of the specifics:
(files on the HPUX servers)
-- /etc/rc.config.c/nfsconf file --
NFS_CLIENT=1
NFS_SERVER=0
NUM_NFSD=0
NUM_NFSIOD=4
PCNFS_SERVER=0
LOCKD_OPTIONS="-C -l /var/adm/rpc.lockd.log"
STATD_OPTIONS="-l /var/adm/rpc.statd.log"
MOUNTD_OPTIONS="-l /var/adm/rpc.mountd.log"
START_MOUNTD=1
AUTO_MASTER="/etc/auto_master"
AUTOFS=1
AUTOMOUNT_OPTIONS="-t 900 -v"
AUTOMOUNTD_OPTIONS="-Tv"

-- /etc/auto_master file --
/.auto /etc/auto_fs

-- /etc/auto_fs file --
prod -nosuid,soft fileserver:/shared/prod

(File on rhel es3 server)
-- /etc/exports file --
/shared/prod 10.0.20.0/24(rw,insecure,insecure_locks,no_root_squash,sync,no_wdelay,anonuid=998,anongid=651)

This has been the same setup for some time now with out issue, so I'm a little stumped at what has stopped working.


Mike
Dave Olker
Neighborhood Moderator

Re: rpc.lockd and "-C" option

Hi Mike,

> Both 11.23 systems need access to some common files;
> one of which is temporary locked when a user logs in.
> Since it is only a brief lock, this was not been a problem.

How do you know it is only a "brief" lock? Do you know when the lock is released? If so, how?

You obviously have an HP support contract, or at least I assume you do since you mentioned HP support told you to install the libc patch, so what does HP support tell you to do in this case? Have they made any recommendations to resolve the problem?

Were I troubleshooting the problem, the next data I'd want to see is a debug rpc.lockd logfile collected on both HP-UX clients as well as some kind of rpc.lockd output from the server. I have no idea if RedHat is able to do any debug logging of the rpc.lockd daemon, but you can enable debug logging of rpc.lockd on the HP-UX systems by sending a SIGUSR2 signal to the running rpc.lockd daemon.

My suggestion would be as follows:

1) kill -SIGUSR2 on one HP-UX system
2) kill -SIGUSR2 on the other HP-UX system
3) reproduce the problem where one client can get the lock and then release it but the other client can't get the lock
4) kill -SIGUSR2 on one HP-UX system
5) kill -SIGUSR2 on the other HP-UX system

The SIGUSR2 signal is a toggle, so the second time you send this signal it will turn off the debug logging. I'd then examine the two debug log files to see if there is anything in those log files pointing to the problem. I'd especially be curious if the 1st NFS client is really releasing all the locks on the shared file.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Hi Dave, I started a new case(3600299574) with HP support about 7 hours ago, but have not heard back yet.

I'm afraid I do not have any evidence of "how brief" the lock is; only that it was working fine with 300-400 users coming and going all day, so I was assuming it was brief.

Toggling the debug trigger to get see the rpc.lockd output yielded interesting results. But nothing that means much to me.

Examples:
----- Working Server -----
*********** Toggle Trace on *************
12.09 23:01:44 mpcapp1 pid=742 /usr/sbin/rpc.lockd
LOCKD QUEUES:
***** granted reclocks *****
(4005f790), oh=mpcapp1 2445, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805308813, 805308814], client=mpcapp1, cookie=775d4351
(4005f6e0), oh=mpcapp1 2673, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805309041, 805309042], client=mpcapp1, cookie=59f1e385
(4005f840), oh=mpcapp1 3070, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805309438, 805309439], client=mpcapp1, cookie=7c11834c
(4005f8f0), oh=mpcapp1 4082, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805310450, 805310451], client=mpcapp1, cookie=56a88e67
(4005fb00), oh=mpcapp1 5102, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805311470, 805311471], client=mpcapp1, cookie=11808ef6
(4005fdc0), oh=mpcapp1 5305, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805311673, 805311674], client=mpcapp1, cookie=42154fc3
(4005fd10), oh=mpcapp1 7988, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805314356, 805314357], client=mpcapp1, cookie=427392c1
(4005ff20), oh=mpcapp1 8247, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805314615, 805314616], client=mpcapp1, cookie=3265714a
..... lot's of this .....
***** blocked reclocks *****
(4005f9a0), oh=mpcapp1 2673, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=2, range=[804257792, 804323328], client=mpcapp1, cookie=6d550d5a
(4005fa50), oh=mpcapp1 3070, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=2, range=[804257792, 804323328], client=mpcapp1, cookie=384b9058
(4005fbb0), oh=mpcapp1 4082, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=2, range=[804257792, 804323328], client=mpcapp1, cookie=50b59949
(4005fe70), oh=mpcapp1 13386, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=2, range=[804257792, 804323328], client=mpcapp1, cookie=2cd139cf
used_le=36, used_fe=5, used_me=2
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd

alarm! enter xtimer:
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
xtimer retransmit:
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
nlm4_call to (mpc6, 7) op=2, (804257792, 65536); retran = 1, valid = 0
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
call_udp[mpc6, 100021, 4, 7] returns 0
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
xtimer retransmit:
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
nlm4_call to (mpc6, 7) op=2, (804257792, 65536); retran = 1, valid = 0
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
call_udp[mpc6, 100021, 4, 7] returns 0
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
klm_msg = 4005fe70
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
xtimer reply to (4005fe70):
12.09 23:01:52 mpcapp1 pid=742 /usr/sbin/rpc.lockd
klm_reply: stat=3

And lots of stuff like that. Any idea how I make heads or tails out of what this detail is telling me?

Thanks,


mike
Dave Olker
Neighborhood Moderator

Re: rpc.lockd and "-C" option

Hi Mike,

Let's take one of the rpc.lockd log entries and I'll give you a crash course on interpreting the contents.

(4005f790), oh=mpcapp1 2445, svr=mpc6,
fh=010000020008000b81ec2600b62d2700d385ddd0612c2700,
op=6, range=[805308813, 805308814], client=mpcapp1, cookie=775d4351


Breaks down as follows:

(4005f790) - Location in memory where lock is held - not really important for this case

oh=mpcapp1 2445 - Owner handle. This consists of the name of the client holding the lock and the process ID of the application on the client holding the lock. So if you were to log into "mpcapp1" and look for process id "2445" that would be the pid holding this lock.

svr=mpc6 - Server. The NFS server holding the lock on this file.

fh=010000020008000b81ec2600b62d2700d385ddd0612c2700 - File Handle. This is the blob of data the server returned to the client when the client asked to access this file. If you break down the individual fields of the file handle you will usually find the device ID of the disk device housing the filesystem as well as the inode number of the file being locked. You can use this information to figure out which file this lock is really for.

op=6 - Lock Flags. This is a combination of the following:

#define LOCK_SH 1
#define LOCK_EX 2
#define LOCK_NB 4

So an op=6 would be LOCK_EX and LOCK_NB, meaning a non-blocking exclusive lock.

range=[805308813, 805308814] - Lock Range. This is the starting and ending offset of the lock in the file. So in this case the starting offset is 805308813 and the range is 1 byte (i.e. ending offset = 805308814).

client=mpcapp1 - Lock Client. The name of the NFS client holding the lock.

cookie=775d4351 - Lock Cookie. This is a unique identifier for this lock.


With this in mind, the log file, or at least the snippet you posted, shows this client has lots of locks in the granted queue for the same file (file handle is the same for all the locks in this snippet) at different offsets and these locks are held by different processes on the client.

There are also several locks in the blocked lock queue, meaning these are locks that cannot be granted because another process is already holding the requested lock. This could mean another process on the local client already has the lock, or a process on a different NFS server has the lock.

In any case, this logging data is very useful if you know how to interpret it, but you need the information from the failing NFS client system and the NFS server in order to piece together why the lock on the failing client is being rejected.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Dave, thank you for all your good information. It turned out that once I got the "-C" option added in to the LOCKD_OPTIONS for both of my HP-UX 11.23 servers that were accessing a Linux nfs server, everything smoothed out. Ahhhh happy production users again.

Hopefully I'll never have to dig too deep into nfs locking, but at least I'll have a clue where to start, if I need to.

Thanks again.

mike
Pete Randall
Outstanding Contributor

Re: rpc.lockd and "-C" option

Mike,

Suggestion: Next time, start a new thread of your own so that you can properly reward Dave with the points he deserves.


Pete

Pete