Operating System - HP-UX
1833780 Members
2385 Online
110063 Solutions
New Discussion

Sever degredation of RPC communications

 
SOLVED
Go to solution
Ralph Grothe
Honored Contributor

Sever degredation of RPC communications

Hi network wizards,

I have a system that experiences considerable impededance of NFS/RPC Services.

Boundary conditions:

# uname -srv
HP-UX B.11.11 U

# echo "sc prod mem;info;wait;il"|cstm|grep -i total
Total Configured Memory : 8192 MB
Cell Total (MB): 4096
Cell Total (MB): 4096
System Total (MB): 8192
PDT Total Size: 100
Total Configured Memory : 8192 MB
Cell Total (MB): 4096
Cell Total (MB): 4096
System Total (MB): 8192
PDT Total Size: 100

# model
9000/800/rp7410

# swapinfo -ta
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 8388608 1110752 7277856 13% 0 - 1 /dev/vg00/lvol2
dev 12288000 1105884 11182116 9% 0 - 1 /dev/vg00/lvol12
reserve - 6898764 -6898764
memory 6463356 1624092 4839264 25%
total 27139964 10739492 16400472 40% - 0 -


The system is cluster leader in a two-node MC/SG cluster that hosts the single and only package as primary node, which itself is an Oracle/SAP R3 application (thus the 20 GB swap)

The system load is and has been negligable

# uptime
11:56am up 46 days, 21:39, 4 users, load average: 0.15, 0.14, 0.15

The same goes for CPU utilization.

The system seems only to suffer from memory insufficiencies.

E.g. the curren summary from glance's mem report reads this

Total VM : 8.74gb Sys Mem : 1.48gb User Mem: 5.57gb Phys Mem: 7.98gb
Active VM: 2.69gb Buf Cache: 817.6mb Free Mem: 136.4mb

E.g. current swap report shows this

Swap Available: 26504m Swap Used: 3751mb Swap Util (%): 40 Reserved: 10489m

Yesterday when the RPC problem became apparent I couldn't even start glance for a while, and thus ran "sar -w" which showed swapout activity.

Therefore I'm tempted to believe that our RPC problems are just a cause of insufficient memory.

NFS-wise the same cluster node is NFS server and NFS client.
This is due to the SAP quirk to make heavy use of automounter and lots of NFS exports and imports accross a whole farm of other servers (e.g. SAP transports).
This automount stuff in a cluster environment is causing quite some grief, but SAP prerequisites it.

So, somtimes RPC tools like showmount timeout, while at other times they respond quickly.

I have quite a few entries about failed NFS communication in syslog.log (n.b. alster is the NFS server/exporter, lena (the cluster package's virtual hostname) the client, both are on the same machine (thus I think I can neglect observations of the LAN)

# grep -i nfs /var/adm/syslog/syslog.log|sed -n '/Jan 21/,$p'|tail
Jan 22 05:00:32 alster vmunix: NFS server (pid716@/usr/sap/trans) ok
Jan 22 05:00:32 alster vmunix: NFS server (pid716@/sapmnt/Z01) ok
Jan 22 05:04:02 alster vmunix: NFS server (pid716@/sapmnt/Z01) ok
Jan 22 05:04:02 alster vmunix: NFS server (pid716@/sapmnt/Z01) not responding still trying
Jan 22 05:30:46 alster vmunix: NFS server (pid716@/sapmnt/Z01) not responding still trying
Jan 22 05:31:01 alster vmunix: NFS server (pid716@/sapmnt/Z01) ok
Jan 22 06:50:28 alster vmunix: NFS server (pid716@/sapmnt/Z01) not responding still trying
Jan 22 06:51:02 alster vmunix: NFS server (pid716@/sapmnt/Z01) ok
Jan 22 07:40:00 alster vmunix: NFS server (pid716@/sapmnt/Z01) not responding still trying
Jan 22 07:40:00 alster vmunix: NFS server (pid716@/sapmnt/Z01) ok


The RPC stats for the server look like this

# nfsstat -sr

Server rpc:
Connection oriented:
calls badcalls nullrecv
57792 0 0
badlen xdrcall dupchecks
0 0 20800
dupreqs
0
Connectionless oriented:
calls badcalls nullrecv
13668160 0 0
badlen xdrcall dupchecks
0 0 4879982
dupreqs
188


which looks sound to me, while the client side exhibits probably too many timeouts and retransmissions

# nfsstat -cr

Client rpc:
Connection oriented:
calls badcalls badxids
0 0 0
timeouts newcreds badverfs
0 0 0
timers cantconn nomem
0 0 0
interrupts
0
Connectionless oriented:
calls badcalls retrans
4011982 139 13067
badxids timeouts waits
13186 12027 0
newcreds badverfs timers
0 0 65907
toobig nomem cantsend
0 0 0
bufulocks
0


Because there are so many services and partners involved in NFS I fear that a thorough NFS performance analysis is beyond my possibilies/horizon.

Therefore, I would like to ask you network gurus for some advice what to look for, and where improvements could be applied.

Rgds.
Ralph
Madness, thy name is system administration
3 REPLIES 3
Ron Kinner
Honored Contributor
Solution

Re: Sever degredation of RPC communications

From

http://secu.zzu.edu.cn/book/NetWork/NetworkingBookshelf_2ndEd/nfs/appb_02.htm

"badxids ~ timeout
RPC requests that have been retransmitted are being handled by the server, and the client is receiving duplicate replies. Increase the timeo parameter for this NFS mount to alleviate the request retransmission, or tune the server to reduce the average request service time."

You might also check your patches.

http://www2.itrc.hp.com/service/patch/search.do

shows 29 patches when you just search on nfs.

Ron
Stefan Farrelly
Honored Contributor

Re: Sever degredation of RPC communications

I would say all your problem stem from lack of memory and consequent massive swapping. Performance degrades so much during heaby swapping that everything can be affected; nfs daemons, network, root processes etc. I think someone once said performance degrades by around 100 times during swapping.

I would fix your memory problems first to see if it fixes your problem.

If not, then the best source for looking indepth at nfs issues and fixing them is a book called "optimizing nfs performance" by Olker (available on Amazon). Excellent, talks about everything, nfs, rpc, network and how to tune and debug.
Im from Palmerston North, New Zealand, but somehow ended up in London...
Ralph Grothe
Honored Contributor

Re: Sever degredation of RPC communications

Ron,

the rabbit for your link to ORA networking books.
Kudos to those who provide this service ;-)

Stefan,

you are probably right that the prevailing memory issue is to blame.
I will do further performance checks to single NFS trouble out.
Madness, thy name is system administration