Operating System - Linux
1828281 Members
3771 Online
109975 Solutions
New Discussion

Re: NFS monitor script causing package to fail

 

NFS monitor script causing package to fail

We have been running version A.11.15.02 of MC/SG with the version A.01.02 nfs/samba toolkit on RH ES 3 release 2 for over 6 months with no problems until recently. Over the last couple of months the nfs.mon script has been unable to detect the rpc.nfsd process (via rpcinfo) to the extent that the monitor script forces the package to halt and switch over. You can see the retry attempts in the hanfs.sh.log file. Does anybody have any idea what is causing this and how we can correct it? We are running our cluster on two HP DL-380 Proliant servers.
10 REPLIES 10
Serviceguard for Linux
Honored Contributor

Re: NFS monitor script causing package to fail

I've forwarded your question to the team responsible for the toolkit. In the mean time, can you check the A.01.03 version. I'm not sure if it is true for the nfs toolkit, but for some toolkits that used "ps" and "grep" there is a problem with the way the "/proc" file system treats the ps command.

If you can look to see if the change between ".02" and ".03" is related to the monitoring, that might speed up resolution of your problem.
John Bigg
Esteemed Contributor

Re: NFS monitor script causing package to fail

If the NFS toolkit or any part of your package uses ps and relies on this to detect processes running then unfortunately it can fail.

On all versions of the Linux kernel I am aware of it is possible for the kernel routines used by ps to traverse the /proc filesystem which return a list of pids to occaisionally miss some and therefore to not report a process even if it is running. This is especially true on a highly dynamic system where there is a high rate of processes starting and exiting.

Therefore, unfortunately you cannot reply on ps to tell you if a process is running, even if an individual process is checked using the ps -p option.

The solution is to check for processes by looking at the proc filesystem direct. For example it is common for toolkits to use code similar to:

pid=`ps $p_pid | grep $PROC | awk '{print $1}'`
if [ -z "$pid" ]; then

or maybe

pid=`ps -p $p_pid | awk '{print $1}'`
if [ -z "$pid" ]; then

These can both fail. Instead this should be replaced by:

grep $PROC /proc/$p_pid/stat >/dev/null
if [ $? -ne 0 ]; then

and you should then not experience any trouble. By going direct to the individual process proc file entry we bypass the buggy kernel routines which cause the problem.

I believe the toolkits are supposed to be being updated to use this method rather than using ps, but I do not know if they have all been done and I do not know which version you are using.

I suggest checking for ps usage and replace this with direct proc file checking instead. Be careful how you check the proc filesystem since you could hit the same kernel defect, however, if you use code similar to that shown above you should be fine.

Re: NFS monitor script causing package to fail

Sorry for the delay in responding. As is typical for most of use, too much work, not enough bodies to do it. Anyways, I installed A.11.15.02 and A.01.03 of MC/SG and the NFS toolkit respectively on a different box and in the nfs.mon script for version A.01.03, they don't use "ps" anymore. As John mentioned, the newer toolkit uses the following code to return the PID of the named process:

for k in /proc/*
do
if [ ! -f $k/stat ]; then
continue
fi
pid=`grep "($1)" $k/stat`
if [ ! -z "$pid" ]; then
break
fi
done

I haven't had time to check for any other differences. I'm wondering if it would be safe to just substitute this nfs.mon script (from A.01.03) to see if that corrects the problem. Unfortunately, I don't have the resources/time to set up a test NFS environment.
Serviceguard for Linux
Honored Contributor

Re: NFS monitor script causing package to fail

Reesponse from the team:

In NFS toolkit, monitoring of NFS daemons is done in two levels. First it checks the status of NFS services using the rpcinfo command. (Eg: rpcinfo -u 127.0.0.1 100003 2). The ps command will check the status of NFS daemons only if the rpcinfo fails. To find out the exact problem, we need to understand why rpcinfo is failing on the production machine and hence it will be good to get the output of the command "rpcinfo -u 127.0.0.1 100003 2" on the production machine.

Substituting nfs.mon script from A.01.03 to A.01.02 will not work as nfs.mon in A.01.03 makes use of hanfs.conf which is not present in A.01.02.

Re: NFS monitor script causing package to fail

This is what I got on my production box when I ran "rpcinfo -u 127.0.0.1 100003 2":

program 100003 version 2 ready and waiting
program 100003 version 3 ready and waiting

Since going "live" in May of 2005, we've seen this problem occur 6/15, 12/21, 2/2 and 2/24 with 3 occurrences on the primary and once on the secondary.
Asha Yarangatta
New Member

Re: NFS monitor script causing package to fail

Hi,

We have not been able to simulate the problem in our test environment given the fact that the NFS monitoring script fails once in a while and not always. So there can be some issue with the nfs server also which might be causing the package to failover.Can you please confirm us that the package failover is happening though all the nfs daemons are running? Please check the nfs log files and let us know.
I feel that it would be better if you can upgrade your toolkit to A.01.03 as it does not use ps anymore.

Thanks,
Asha

Re: NFS monitor script causing package to fail

I'm not really sure what you mean by nfs log. If you mean /var/log/messages, there is no mention of nfsd which fits the symptom that there probably was nothing wrong with nfsd but the ps couldn't find it.

When we migrate to new NFS servers, we can upgrade to A.01.03 (or whatever version is available) but we can't justify it now unless this problem gets much worse.
Asha Yarangatta
New Member

Re: NFS monitor script causing package to fail

NFS debugging messages can be enabled by using the following commands

echo "217" >|/proc/sys/sunrpc/nfs_debug
echo "217" >|/proc/sys/sunrpc/nfsd_debug

After enabling, all the debugging messages of NFS will go to /var/log/messages.

NFS works on rpc mechanism. If rpcinfo command fails, then NFS will not work even if nfs daemon (nfsd) is running. So the monitor script nfs.mon works by periodically checking the status of NFS services using the rpcinfo command. If any service fails to respond, the script exits, causing a switch to an adoptive node.

The monitor script monitors NFS services including:
â ¢ portmap
â ¢ rpc.statd
â ¢ nfsd
â ¢ rpc.mountd
â ¢ rpc.rquotad
â ¢ lockd
If any of the services are dead or hangs, the nfs.mon will cause the package to fail.

So the monitoring of NFS daemons is done using rpcinfo command. â psâ command is used just for logging whether the process is dead or hung.

Can you please post your package log file so that we can investigate further?

Re: NFS monitor script causing package to fail

I've turned on NFS debugging on the primary cluster server (currently hosting the NFS package). What is the command to turn it off?

I've also attached the nfs package log file nfsla1.cntl.log and nfs monitor service log file hanfs.sh.log from the primary cluster server.
Asha Yarangatta
New Member

Re: NFS monitor script causing package to fail

I think NFS debugging messages can be turned off by using the following commands

echo "000" >|/proc/sys/sunrpc/nfs_debug
echo "000" >|/proc/sys/sunrpc/nfsd_debug

I am not sure of the above commands. Please check it out.

I understand from the attached log files that monitoring script failed because of the following reasons

1) On May 23, rpc.mountd process was not up and running.

2) On Jun 15, rpcinfo failed to find nfsd but nfsd process was running.

3) On Oct 17, rpcinfo failed to find mountd but rpc.mountd process was running.

4) On Dec 21, rpcinfo failed to find mountd but rpc.mountd process was running.

5) On Feb 24, rpcinfo failed to find nfsd but nfsd process was running.

Seeing the log messages, I understand that rpcinfo is failing to work sometimes. I feel you should increase your RETRY_INTERVAL and RETRY_TIMES[] for mountd and nfsd in nfs.mon.