Operating System - HP-UX
1752767 Members
5020 Online
108789 Solutions
New Discussion юеВ

rpc.lockd and "-C" option

 
SOLVED
Go to solution
Pyers Symon
Advisor

rpc.lockd and "-C" option

In dave Olker's NFS book an incompatibility between HP's rpc.lockd and those NLM's derived from Sun's code is discussed. (Broadly speaking a dead-lock could occur due to the misreading of an acknowlegement of a lock-cancel messge). In the book Olker states that if present in a mixed Unix environment rpc.lockd should be started with the -C option. This is the ONLY reference to this problem I have seen .... What is the current position?
14 REPLIES 14
Peter Godron
Honored Contributor

Re: rpc.lockd and "-C" option

Pyers,
how good is your Taiwanise ?

Seems there is a doc:
http://www1.itrc.hp.com/service/cki/docDisplay.do?docLocale=zh_TW&docId=200000075006795

4000033988A
HP-UX NFS: rpc.lockd -C option required for nlm_cancel requests

Linked from:
http://www.hp.com.tw/ssn/unix/0506/unix050606.asp

Perhaps somebody with access to Americas/Asian Pacific access can post?

Dave Olker
HPE Pro

Re: rpc.lockd and "-C" option

Hi Pyers,

I might know something about this issue...

I'm not sure what you mean by the "current position". Are you asking if this is still a problem? Are you asking for more details about what the -C option does and when you should use it? Are you asking if we have plans to change this behavior?

Please clarify your question and I'll be happy to answer it.

Regards,

Dave
I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Pyers Symon
Advisor

Re: rpc.lockd and "-C" option

Only the best for this thread! We use a number of 11.11 & 11.23 boxes accessing data from a bunch of NetApps filers. I am in discussion with colleagues who are trying to get information out of NetApps as to the etymology of their NFS. Until that happens, is it better to fail-safe and use the -C flag or carry on as we are? Also what would be the symptoms of a repeated lock request as you describe? Application hang? My comment about the current position is whether the -C flag is now the default with HP's rpc.lockd (which to be fair I doubt)? Ps Great book (although it was tricky to find in the UK!)
Dave Olker
HPE Pro
Solution

Re: rpc.lockd and "-C" option

Hi Pyers,

> Only the best for this thread!

HA HA HA!! Nice of you to say. :)


> We use a number of 11.11 & 11.23 boxes
> accessing data from a bunch of NetApps
> filers. I am in discussion with
> colleagues who are trying to get
> information out of NetApps as to the
> etymology of their NFS. Until that
> happens, is it better to fail-safe and
> use the -C flag or carry on as we are?

I am not certain of the etymology of NetApp's NFS implementation (i.e. did they get it from Sun or write it themselves) but I would be absolutely shocked if they, or any other vendor, had the same bug in their code that we managed to add to ours. :(

To be clear, this is a bug in HP's implementation. We coded the rpc.lockd procedure incorrectly in that we misinterpretted a reply code from the server.

The problem happens when there are competing processes vying for a blocked lock on a file. If processA grabs the lock and processB tries to get the lock and blocks waiting for processA to release it, but then processB gets tired of waiting and tries to cancel the lock - that's when the problem happens.

The HP-UX client sends an NLM_CANCEL_MSG to the server, the server correctly cancels the lock and sends back a reply with status of 0 - as it should. The problem is our client misinterprets the 0 reply to mean "I didn't cancel your lock" so it immediately sends another NLM_CANCEL_MSG. The server replies with another 0 saying "I already cancelled your stupid lock, now leave me alone". The client sees the 0 and says "Crap, it still hasn't cancelled my lock. I'd better send another NLM_CANCEL_MSG". This goes on for some time until someone takes pity on the application or rpc.lockd and shoots them in the head.


> Also what would be the symptoms of a
> repeated lock request as you describe?
> Application hang?

Yes. The application that is trying to cancel the blocked lock will hang while rpc.lockd sends NLM_CANCEL_MSG requests in an endless attempt to get rid of the lock - all the time not realizing the lock was cancelled the first time it sent the request.


> My comment about the current position is
> whether the -C flag is now the default
> with HP's rpc.lockd (which to be fair I
> doubt)?

As of the current 11.23 code we are still using the "incorrect" behavior by default and only do the "correct" thing when the -C option is used. We chose to do this so as not to introduce an incompatibility with our own systems by changing behavior mid-release.

We are planning on changing over to the correct behavior at 11.31 (i.e. 11i v3). What this means is any system running 11.31 and interoperting with pre-11.31 systems will potentially cause problems unless those pre-11.31 systems are using the -C option.

Again, we didn't want to introduce a situation where things were working fine before you update to a newer HP-UX release and then suddenly you start seeing application hangs because we fixed a behavioral problem with rpc.lockd. However, enough time has passed that we feel it's better to do the right thing by default and clearly document the fact that we're changing it.


> Ps Great book (although it was tricky to
> find in the UK!)

I'm glad you found the book helpful. :)

Feel free to contact me if you ever have any other questions about HP's NFS offering. My email is listed prominently in my forum profile.

Regards,

Dave
I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

I believe I am having this problem as well; running nfs.client on 11.23 HPUX connecting to a Linux nfs system (Redhat es3).

Where exactly do you put the -C parameter. In the /etc/rc.config.d/nfsconf file as a LOCKD_OPTIONS??

Thank you for well described condition by the way.
Dave Olker
HPE Pro

Re: rpc.lockd and "-C" option

Hi Mike,

Yes, you would put this option in the LOCKD_OPTIONS line of /etc/rc.config.d/nfsconf file, as follows:

LOCKD_OPTIONS="-C"

This will cause rpc.lockd to start with this option the next time the system is booted. However, it won't affect the currently running rpc.lockd daemon. If you want to get the new behavior now, I'd suggest the following:

1) kill rpc.statd and rpc.lockd
2) /usr/sbin/rpc.statd
3) /usr/sbin/rpc.lockd -C

You always want to start rpc.statd before starting rpc.lockd. If you do this quickly you should see minimal disruption of service from rpc.lockd.

Hope this helps. Let me know if you have any other questions about rpc.lockd or anything else NFS.

Regards,

Dave
I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Hi Dave, this really looked like my issue, but the -C option did not seem to help. Let me explain a little further.

I have two HPUX 11.23 server using the same NFS mount on a Rhel es3 server. This process was working fine, until recently when working with HP support on another issue, they suggested installing the latest 11.23 patches, which went fine but did not solve my issue. Then it was suggested the PHCO_37228 patch be installed.

Since installing the 37228 patch, and even after uninstalling it, now I'm having problems with the nfs mount, specifically the rpc.lockd.

Which ever server touches it first seems to win, and then the other server can't get any locks at all.

I would appreciate any suggestion at this point.

Thanks.

Mike
Dave Olker
HPE Pro

Re: rpc.lockd and "-C" option

Hi Mike,

It sounds like you're speaking in code here. Can you be specific about the exact problem you're having and why HP Support suggested a libc patch to resolve it?

Also I don't understand your explanation of the current problem:
_________________________

Which ever server touches it first seems to win, and then the other server can't get any locks at all.
_________________________


What does this mean? Are the two servers running the same application and which ever system starts the application first gets the locks? I don't seem to understand exactly what the problem is.

If you can explain exactly what is happening (actual syntax or error messages are helpful) then I can try to point you in the right direction to find the root cause.

Regards,

Dave
I work for HPE

[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
Mike Nass
Occasional Advisor

Re: rpc.lockd and "-C" option

Sorry Dave, I've been fighting this for about 3 days now. I'll try to be more specific.

We are running an ERP system (written with Cognos PowerHouse tools) on the 2 HPUX servers, that need to share a couple of NFS mounts in order to run, basically like a cluster. The system started having problems with getcwd returning an incorrect response. After the latest 11.23 patches were installed. Then the HP tech suggested the 37228 patch. We were not having any NFS problems at the time, but the libc patch was specifically for getcwd problems.

Both 11.23 systems need access to some common files; one of which is temporary locked when a user logs in. Since it is only a brief lock, this was not been a problem.

I am not seeing any errors in either the syslog, or lockd log file. But which ever of the two ERP systems touches the common file locked during login first, can continue to lock the file as users come and go. And the other server can no longer get a lock on that file at all.

Some of the specifics:
(files on the HPUX servers)
-- /etc/rc.config.c/nfsconf file --
NFS_CLIENT=1
NFS_SERVER=0
NUM_NFSD=0
NUM_NFSIOD=4
PCNFS_SERVER=0
LOCKD_OPTIONS="-C -l /var/adm/rpc.lockd.log"
STATD_OPTIONS="-l /var/adm/rpc.statd.log"
MOUNTD_OPTIONS="-l /var/adm/rpc.mountd.log"
START_MOUNTD=1
AUTO_MASTER="/etc/auto_master"
AUTOFS=1
AUTOMOUNT_OPTIONS="-t 900 -v"
AUTOMOUNTD_OPTIONS="-Tv"

-- /etc/auto_master file --
/.auto /etc/auto_fs

-- /etc/auto_fs file --
prod -nosuid,soft fileserver:/shared/prod

(File on rhel es3 server)
-- /etc/exports file --
/shared/prod 10.0.20.0/24(rw,insecure,insecure_locks,no_root_squash,sync,no_wdelay,anonuid=998,anongid=651)

This has been the same setup for some time now with out issue, so I'm a little stumped at what has stopped working.


Mike