1830250 Members
2756 Online
110000 Solutions
New Discussion

Lost in Las NFS

 
Andreas Fassl
Frequent Advisor

Lost in Las NFS

hi,

after several hours I tracked down a weird problem, but I can't find a final solution.

- Several Tru64-systems (5.1a) using NFS-drives for the home accounts (via NIS)
- Suddenly (after a change in the internal domain name, this is my actual guess as reason) the login "hangs".
- This only happens to users having the ksh as login shell

I browsed to all related articles in this forum and did lots of searching in google. Any hints I found I tested:
- showmount
- rpcinfo
- netstat -ai
- netstat -rn

Finally I checked the login process with ps from a privileged account -> the process is hanging "lockcntl". using lockcntl as a keyword in google gives only a few hits, most of them are related to NFS problems.

So I generated some more local users, having their home at several (different systems (in OS and Version) NFS shares. Result: The login hangs.
Using a local attached drive: no problem.

Have you got any hint what to check next? Or has someone of you got the magic rabbit?
Any help is greatly appreciated.

With kind regards

Andreas
11 REPLIES 11
Joris Denayer
Respected Contributor

Re: Lost in Las NFS

Hi Andreas,

First, this occurs when the ksh wants to do a flock on your history file. As the bourne shell doesn't have an history file, it will not happen there.

You should check if rpc.lockd and rpc.statd are running on the NFS server.

You could also post the output of the command, executed on the NFS server.
# rpcinfo -p

Joris
To err is human, but to really faul things up requires a computer
Andreas Fassl
Frequent Advisor

Re: Lost in Las NFS

Hi,

thanks for the quick response. rpcinfo -p gives me for all three nfs servers correct answers.

the servers are
- NAN01 (a network appliance alpha based system
- NAN03 (a network appliance nfs server)
- decbb11 (a Tru64 5.1a NFS-Server on an ES40)

The weird thing: They all function without any problem, but obviously the ksh has got a problem.
I think, I'll try to set up TCPDUMP on the test system and analyze the ip traffic. Any things to look for? Or is there any other way to analyze the process hanging in the "lockctl" state?

With kind regards

Andreas
Johan Brusche
Honored Contributor

Re: Lost in Las NFS


Andreas,

When you do a "mount -l -t nfs", do you see only the hostname of the servers, or their full qualified name ? If fully qualified, is it the old or the new domainname ?
Are the NFS-servers in /etc/hosts mapped to the new domain ?
Is /etc/resolv.conf correctly configured for
transition from the old to new domainname ?

JB.

_JB_
Andreas Fassl
Frequent Advisor

Re: Lost in Las NFS

Hi,

this anwser (mount -l -t nfs) I'll test monday.

I'm quite sure that all the hazzle came due to the untested domain name change. Yes, the site has mostly only unqualified host names.

Maybe I'll see more on monday. Have you got some more hints? The file system IS mounted, it can be accessed without any problem.
Ralf Puchner
Honored Contributor

Re: Lost in Las NFS

Have you checked NFS locking? Have you installed actual pachkits?
Help() { FirstReadManual(urgently); Go_to_it;; }
Andreas Fassl
Frequent Advisor

Re: Lost in Las NFS

Hi,

did some more analyzing (thanks for the hints so far).

I'm now able to reproduce the problem on any client on-site.

Condition:
- User account is using a NFS share as home directory
- the KSH is used
Any other thing doesn't care.

The box I used for some more testing is a 5.1B system with the ECO1 patch kit installed.

I installed and configured TCPDUMP for some more testing.

Here is some output:
decn02.b600de6a > nan01.nfs-v3: 116 call getattr fh 1563813.121416221.32.2782664448
nan01.b600de6a > decn02.nfs-v3: 112 reply getattr {dir size 94208 mtime 1079975652.929602000 ctime 1079975652.929602000}
decn02.b700de6a > nan01.nfs-v3: 120 call access fh 1563813.121416221.32.2782664448 want: lookup
nan01.b700de6a > decn02.nfs-v3: 120 reply access {dir size 94208 mtime 1079975652.929602000 ctime 1079975652.929602000} permitted: l
ookup
decn02.b800de6a > nan01.nfs-v3: 120 call access fh 1563813.121416221.32.2782664448 want: lookup
nan01.b800de6a > decn02.nfs-v3: 120 reply access {dir size 94208 mtime 1079975652.929602000 ctime 1079975652.929602000} permitted: l
ookup
decn02.b900de6a > nan01.nfs-v3: 128 call lookup { fh 1563813.121416221.32.2782664448 ".profile"}
nan01.b900de6a > decn02.nfs-v3: 116 reply failed, status No such file or directory: lookup dir {dir size 94208 mtime 1079975652.9296
02000 ctime 1079975652.929602000}

(This happens during login on the share, there is no .profile on the directory, but this isn't the problem).

I found another interesting message exchange:

decn02.65982e02 > nan01.pmap-v2: 56 call getport prog "nlm" V4 prot UDP port 0 (DF)
nan01.65982e02 > decn02.pmap-v2: 28 reply getport 690
decn02.66982e02 > nan01.nlm-v4: 164 call lock_msg cookie 0x84 noblock,excl lock: {"decn02.xxx.de" svid 396160 l_offset 0 l_len 0}
not-reclaim state 0 (DF)

I translate this into:
- DECN02 ask NAN01 for a port for program nlm (NFS Lockmanager)
- NAN01 answers and offers port 690
- DECN02 sets a cookie, but gets no answer.

I asked the guys to check those things:
1) Does NAN01 a correct resolution of the hostname?
2) Is there a firewall between these two systems?

Have you got another hint?

Thanks in advance.

The result of rpcinfo -p decn02 and rpcinfo -p nan01 looks okay, both system are running the lockmanager. But I'm quite sure, this isn't a problem in the box, because the systems are running for months without a problem.

Regards

Andreas
Ralf Puchner
Honored Contributor

Re: Lost in Las NFS

the used shell uses nfs locking, this seems to be the problem... it is a known problem. This is the reason asking if nfs locking really works....

Help() { FirstReadManual(urgently); Go_to_it;; }
Andreas Fassl
Frequent Advisor

Re: Lost in Las NFS

Hi,

I've found a very old article at
http://www.faqs.org/faqs/sgi/faq/apps/section-26.html
from 1995.

Date: 15 Oct 1995 00:00:01 EST

ksh(1) uses a single ~/.sh_history file for all of a given user's ksh
processes, so must be able to lock that file. Locking is robust for
local files but not over NFS. Install patch 547 (or its successor) to
fix some known NFS bugs and be sure lockd is 'chkconfig'ed on and
rpc.lockd and rpc.statd are actually running. If all else fails, set
the HISTFILE environment variable to a file on a local disk.

We'll try the HISTFILE workaround and in parallel do an analysis of the locking behavior.

To check the correct behaviour I'll look into the daemon.log files of the NFS server, compare the rpcinfo output of client and server, correct? Any other logfiles to check?

Regards

Andreas

Relevant Cases I've found are:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=296907

Will this work, too?

http://www4.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&admit=-938907319+1080037159406+28353475&docId=200000063201877

Ralf Puchner
Honored Contributor

Re: Lost in Las NFS

check if nfs locking is properly configured on each machine, monitor daemon.log on the machines for any problems....
Help() { FirstReadManual(urgently); Go_to_it;; }
Andreas Fassl
Frequent Advisor

Re: Lost in Las NFS

Solution found, Ralfs hint wasn't the solution but showed me the way to the real problem. As I guessed, it was the name server.
Very important for the lock manager is the correct resolving in both directions - name resolution and reverse lookup. In this case the reverse lookup for the ip address of the nfs server gave another name. that easy.

Thanks a lot for all the hints.

Regards

Andreas
Ralf Puchner
Honored Contributor

Re: Lost in Las NFS

Andreas,

checking locking means also to have a look into the lock directory checking the names stored as files (in the sm directory). If there are unqualified hostnames in it you have the reason for malfunction.

During check you will also find any nameresolution problems because it is part of the checks.....

Help() { FirstReadManual(urgently); Go_to_it;; }