Operating System - HP-UX
1827302 Members
3248 Online
109961 Solutions
New Discussion

rpc.statd looke like runaway consuming 100% cpu

 
SOLVED
Go to solution
skt_skt
Honored Contributor

rpc.statd looke like runaway consuming 100% cpu

From the GPM' process system calls are very high and the two main syscall name are getrlimit and poll.

CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
3 ? 2379 root 152 20 4436K 656K run 32915:13 100.09 99.92 rpc.statd

here is what i see in "tusc -p pid"

[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192
[2379] poll(0x6d3f0590, 4, -1) ........................................................................ = 1
[2379] sigblock(0x2000) ............................................................................... = 0
[2379] getrlimit(RLIMIT_NOFILE, 0x6d3f4608) ........................................................... = 0
[2379] sigsetmask(NULL) ............................................................................... = 8192

Planning to restart the nfs.client as it will restarts the rpc.statd, rpc.lockd , rpc.mountd and nfsd.

any suggetions would be welcomed..
10 REPLIES 10
Dennis Handly
Acclaimed Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

>Planning to restart the nfs.client as it will restarts the rpc.statd, rpc.lockd, rpc.mountd and nfsd.

It seems reasonable, unless it keeps happening.
skt_skt
Honored Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

i forgot to mention that this node is part of a CRS cluster and there are NFS file systems mounted here(CRSVIPip:/mount_point) which will be unmounted during the nfs.client restart. That mount point is used by the CRS instance running. Not sure how the CRS react to if the mount point is temporarily unavailable.
Dave Olker
Neighborhood Moderator

Re: rpc.statd looke like runaway consuming 100% cpu

My first recommendation would be as follows:

1) Get a listing of the nodes in /var/statmon/sm:

# ll /var/statmon/sm

2) Get a listing of nodes in /var/statmon/sm.bak:

# ll /var/statmon/sm.bak

3) Collect a debug logfile from rpc.statd and rpc.lockd

# ps -ef | grep rpc
# kill -17
# kill -17

wait 30 seconds

# kill -17
# kill -17

I'd want to examine the debug logfile from rpc.statd to see what it's doing before terminating and restarting it, or just terminating/restarting it might not be enough to clear the race condition.

Let me know if you need any help interpreting the data.

Regards,

Dave



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
skt_skt
Honored Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

i have two node names listed under
/var/statmon/sm.bak.

But both are not currently active. Is that statd looking for these nodes?

i will perform the debug later.
Dave Olker
Neighborhood Moderator

Re: rpc.statd looke like runaway consuming 100% cpu

When you say both are not currently active, do you mean they are temporarily out of service or permanently out of service? In other words, is there any reason why the local system will ever need to talk to those systems again?

Also, can you cat each of the files in /var/statmon/sm.bak and ensure the contents of the files match the names of the files?

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
skt_skt
Honored Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

those servers are permanently removed.
Also the filename and content do match.

This server is one of the partitions from a two npartitioned physical box. And the name i see under sm.bak are thier old console names

so the partions are currently named X1 and X2(node in qusetion) are earlier y1 and y2 . What i see in sm.bak is y1 an y2.

Also under node X1 i am able to see only y2 listed under sm.bak

i am not sure where we are going by this...
Dave Olker
Neighborhood Moderator

Re: rpc.statd looke like runaway consuming 100% cpu

If the servers in /var/statmon/sm.bak are permanently removed then here is what you should do:

1) Kill rpc.statd and rpc.lockd

# kill $(ps -e | egrep 'rpc.statd|rpc.lockd' | awk '{print $1}')

2) Remove any entries from /var/statmon/sm.bak for systems that are permanently gone from your environment

3) Restart rpc.statd and rpc.lockd

# /usr/sbin/rpc.statd
# /usr/sbin/rpc.lockd


Let me know if the problem persists after doing this.

Regards,

Dave


I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
skt_skt
Honored Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

The debug mode shows that statd.d is looking for for those old hosts.

but lockd is not reporting anything.
aded141p:root [/var/adm] cat rpc.lockd.log
08.06 21:06:39 aded141p pid=2385 rpc.lockd
*********** Toggle Trace on *************
08.06 21:06:39 X2hostname pid=2385 rpc.lockd
LOCKD QUEUES:
***** granted reclocks *****
*****no entry in msg queue *****
***** no blocked reclocks ****
used_le=0, used_fe=0, used_me=0
08.06 21:07:46 X2hostname pid=2385 rpc.lockd
*********** Toggle Trace off *************


is it mandatory to bounce both the service?Is it OK to just remove these host entris from /var/statmon/sm.bak.

ideally restarting rpc.statd should not effect anything else.?Unfortunatly i dont have a DEV CRS instance to test with.Not sure how CRS will respond to bouncing the stat.d
Dave Olker
Neighborhood Moderator
Solution

Re: rpc.statd looke like runaway consuming 100% cpu

Yes it is mandatory to bounce both. You should never just stop/restart lockd or statd without the other because they talk to each other and if one of them starts up on a different port the other may have the old port information cached and then unexpected results could happen.

I don't believe removing the entries will resolve the problem as I believe rpc.statd builds a cache of the entries at initialization time so it wouldn't notice the entries are gone until it is stopped and restarted.

> ideally it should not affect anything.

Are there any entries in /var/statmon/sm? That would tell you if this system is doing any NFS file locking as a client or server. If there are no entries in /var/statmon/sm then stopping and restarting these daemons should have no effect.

If you're concerned about how long the services will be available then you can write a small Kshell script that terminate the running daemons, deletes the /var/stamon/sm.bak entries and restarts the daemons. The whole thing should take less than a second to run. Also, if there are no entries in /var/statmon/sm then you can start the rpc.lockd daemon with the "-g 0" option and that will tell the daemon not to use a grace period. This means it will accept new lock requests immediately. Again, virtually no down time for the locking service.

Regards,

Dave



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
skt_skt
Honored Contributor

Re: rpc.statd looke like runaway consuming 100% cpu

that did work as expected