1753913 Members
8776 Online
108810 Solutions
New Discussion юеВ

System hanging

 
SOLVED
Go to solution
Evelyn Daroga
Regular Advisor

System hanging


Running HP-UX 11.0 on an N-class server. We've been experiencing system hanging problems recently, where the system seems to hang up for several seconds, then is ok. GPM shows Memory Wait Queue spiking during these periods, jumping from 0 processes waiting to up to 32 processes waiting on mem resources. It remains like that for a number of seconds -- enough to cause user complaints.
Swapinfo shows:
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 512000 0 512000 0% 0 - 1 /dev/vgroot/swap
dev 3072000 212884 2859116 7% 0 - 0 /dev/vg01/swap
dev 3072000 215900 2856100 7% 0 - 0 /dev/vg02/swap
dev 3072000 215884 2856116 7% 0 - 0 /dev/vg03/swap
dev 3072000 217864 2854136 7% 0 - 0 /dev/vg05/swap
reserve - 3163608 -3163608
memory 2786380 1437704 1348676 52%

I dont think we have a shortage of memory; almost no paging out at all, and almost no VM writing to disk in gpm. I've used the "UNIX95= ps -e -o "user,vsz,pid,ppid,args" | sort -rnk2" command, and output to files during "normal" periods, and during "wait" periods. Looking at the top mem users, it appears that when a wait period begins, a new process appears in the high-mem-usage end of the listing, but when the wait period ends, thoses processes still remain. This doesnt seems like unusual behavior that should suddenly cause obvious performance issues. How can I identify what the exact cause is of mem que waits? Then, how do I go about correcting it?
Any help would be appreciated.
11 REPLIES 11
Sridhar Bhaskarla
Honored Contributor

Re: System hanging

Hi Evelyn,

Looking at your swapinfo, your system is running low on memory. This is because the reserve is around 3GB which gives a good idea of the processes occupying around it. And you can see almost 1 GB of pages sitting on the swap. When a new process is forked, obviously some processes have to be paged out to accommodate it's private area. What does your vmstat -s 's cumulative page out value?.

If you are using the default buffer cache of 50%, you may want to get it lower by altering
dbc_max_pct. Or get the system some more memory like 2GB extra.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try
Uday_S_Ankolekar
Honored Contributor

Re: System hanging

Evelyn Daroga
Regular Advisor

Re: System hanging

Thanks for the quick reply! vmstat -s shows the following for page stats:
438883 pages swapped out
16647831 page outs
39011422 pages paged out
472405557 pages scanned for page out
But, are these values cumulative since the last reboot? The system has been up for 248 days. Would a reboot free up some mem? Also, the dbc_max_pct is at 50%. Would a reduction to 30% be reasonable? Thanks again for your help -- sys tuning is not my specialty!!!
Evelyn Daroga
Regular Advisor

Re: System hanging

Thanks, also, Uday, for the doc reference. I have printed it out, and will certainly study it. I need all the help I can get!
Joseph C. Denman
Honored Contributor

Re: System hanging

Hi Evelyn,

Check out this thread. James attached several good documents, including the one stated above.

...jcd...
If I had only read the instructions first??
Sridhar Bhaskarla
Honored Contributor

Re: System hanging

Well, since the system is up and running since 248 days, there might be some memory leaks hanging out. You can use ipcs -mob and look at NATTCH and look out for non root segments having 0's. This might indicate that they were not released properly. You can use ipcrm to remove them. But I would suggest to halt application and check it up again to see if they were really them. If you don't want to get into trouble, you can reboot the system.

You certainly need to consider decreasing dbc_max_pct value. You can make it around 15% and watch for %wio in sar -u and %rcache and %wcache in sar -b.

dbc_mac_pct has to be adjusted according to the pattern you see. Typically %rcache > 90 and %wcache > 80. If you have %wio > 20, then your system is bottlenecking on IO.

You can start with 15% of dbc_max_pct and observe the above.

This needs reboot. So, this will also fix if you have any memory leaks on the box.

Take a make_recovery tape of the system before rebooting. I have seen systems with this much uptime not come up properly. Also run /sbin/init.d/clean_ex start before rebooting the box as it will take a long time here during the bootup.

-Sri
You may be disappointed if you fail, but you are doomed if you don't try
Evelyn Daroga
Regular Advisor

Re: System hanging

Ok, ipcs -mob indicates only one non-root process with a 0 for NATTCH. SEGSIZE is 1771332. I think I know what it is, and will remove it tomorrow (dont want to today, just in case it bothers nightly processing). I will adjust the dbc_max_pct to 15%, and watch the items mentioned. I am unable to reboot until the weekend. I do a make_recovery every night, so will have one on hand, in case. I'll have to let this wait till next week, but will look at anything else anyone might suggest in the meantime. Thanks for all the suggestions.
Roger Baptiste
Honored Contributor

Re: System hanging

hi,

Regarding tracking current pageouts, run vmstat -n and it will list it out under po column .

for eg:
#vmstat -n 5 5
VM
memory page faults
avm free re at pi po fr de sr in sy cs
96439 1426934 140 31 1 0 46 0 0 5642 12263 3923

***

Also see the usage of vhand and swapper:
ps -ef |grep vhand

IF they are accumulating CPU time rapidly , then it is doing too much of memory swapping, which is not a good thing.

HTH
raj
Take it easy.
Evelyn Daroga
Regular Advisor

Re: System hanging

Changed the dbc_max_pct to 15 and rebooted Sat night. After about 7 hrs "normal" usage, swapinfo shows:
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 512000 0 512000 0% 0 - 1 /dev/vgroot/swap
dev 3072000 58212 3013788 2% 0 - 0 /dev/vg01/swap
dev 3072000 57816 3014184 2% 0 - 0 /dev/vg02/swap
dev 3072000 58444 3013556 2% 0 - 0 /dev/vg03/swap
dev 3072000 58424 3013576 2% 0 - 0 /dev/vg05/swap
reserve - 3729136 -3729136
memory 2786348 655384 2130964 24%


Now I remember why I havent been using sar...
Since upgrading to newer system, and to HPUX 11.0, sar is broken. It wants /var/adm/sa/sa17 which does not exist on the system. Even the dir /var/adm/sa does not exist. Do I need to reinstall something??