Running out of some resource... but what?

Shane Travis · ‎09-29-2000

System: HP-UX 11.0 800-series machine.

Use: Project machine, used by about a dozen people. Many of them use dtterm to access it. Most of them are doing software development or testing, which entails about a dozen tasks running per user. As all s/w developers do, they will re-compile tasks when and as needed.

Problem: Something causes the machine to lock up occasionally; people will not be able to begin any more processes, some of their windows will die, makes will fail, etc. Users see messages like "Can't allocate memory", "No more processes" and "Cannot fork; process terminated".

Analysis: At this time, if I do "ps -ef | wc -l" the process count is usually around 425-430. Also, "swapinfo" tells me that I'm only at about 50% utilization of swap space. SAM tells me that I've got about 1G used, 1.7G reserved, and 1.3G available.

Attempted fixes:
1) Increased swap space from 1G device to 2G device, later adding 2G system as well. (Current total = 2Gig device, 2Gig swap)
2) Mucked with some kernel parameters. Some of the values that might be relevant are:

maxfiles 2048
maxfiles_lim 2048
maxswapchunks 1537
maxuprc 6000
maxusers 400
nfile 12017
nflocks 3220
ninode 7248
nproc 6000
swapmem_on 1

First I thought it was processes, then I thought it was swap... now I don't know what to think. Any help appreciated.

Rick Garland · ‎09-29-2000

Check on nproc and maxuproc.

You may be hitting the limit as too number of process by system and/or user. This limit is defined in the kernel.

John Palmer · ‎09-29-2000

What sort of server is it and how much RAM has it got?

Do you have to reboot to get out of this situation?

The symptoms do indeed sound memory/swap space related.

Do you have glance? If not then installing the trial version will be beneficial.

Otherwise the output from 'vmstat 5 5' would be useful together with 'swapinfo -t' and 'ps -el'

Alan Riggs · ‎09-29-2000

It certainly sounds like a resource exhaustion problem. My first look in such cases is usually to swap, as was yours. Just to make sure, though, when you say swap was 50% utlized does that include used and reserved? You can run into problems with swap space even at 0% "used" if you have 100% reserved. The system will not fork a new process if it cannot reserve space to page it out if necessary. You have 4GB total swap now, but what size is your memory?

Other possibilities are nfiles, ninodes, and nprocs. You can get a look into these with sar -v (though ninodes should not be an issue if you are using vxfs filesystems). Can you post the results of swapinfo and sar -v for the affected system?

Alan Riggs · ‎09-29-2000

Oh, and I forgot to ask, do you have SWAP_MEM activated? It is a good idea in almost all cases.

James R. Ferguson · ‎09-29-2000

Tim:

Do you have a lot of orphan processes when you do a ps ???

Based on the kernel parameters you've listed, I vote for looking at Glance's statistics too (toggle 't' for table)!

...JRF...

CHRIS_ANORUO · ‎09-29-2000

Hi Tim,

Increase the following parameters:
maxfile_lim=4048
maxusers=500
ntpy=512
dbc_max_pct=25
dbc_min_pct=5
nbuf=0
bufpages=0
Look at your shared memory parameters and increase from the default values.

When We Seek To Discover The Best In Others, We Somehow Bring Out The Best In Ourselves.

Shane Travis · ‎09-29-2000

Thanks for all the prompt and helpful answers. Here are my responses, in order of which the posts were made.

==========
> Rick Garland:
> Check on nproc and maxuproc.

Thanks for replying, and no offense intended, but did you even read what I wrote? Or are you suggesting that 6000 and 2000 (respectively) are too low when I clearly stated that I had about 430 processes going at the time of the trouble?

==========
> John Palmer
> What sort of server is it and how much RAM
> has it got?

I know that this is available on system startup, and I'm sure it's available at other times too, but I cannot for the life of me remember how to get it. Little help? :-(

> Do you have to reboot to get out of this
> situation?

Nope... it seems to go away after a while. That's what makes me think it was swap/proc related -- whoever was bagging out the system finished what they were doing and everything went back to normal. (Of course, everyone denies doing anything that would bag the system...)

> Do you have glance? If not then installing
> the trial version will be beneficial.

This project did not buy a copy of glance plus, and I was unaware that there was a free trial version. URL/search word?

> Otherwise the output from 'vmstat 5 5'
> would be useful together with
> 'swapinfo -t' and 'ps -el'

Swapinfo and ps I already use... vmstat is new to me. Thanks.

==========
> Alan Riggs
> When you say swap was 50% utlized does
> that include used and reserved?

According to SAM, Of the 4GB allocated, ~1gig is used, 1.7gig reserved, and 1.3 gig reserved.

> You have 4GB total swap now, but what size is your memory?

See above befuddlement.

> Can you post the results of swapinfo
> and sar -v for the affected system?

These are going to be ugly, I'm sure... do HTML tags work here?


# swapinfo -t
TYPE      AVAIL    USED    FREE  USED   LIMIT RESERVE  PRI  NAME
dev     1048576  401172  647404   38%       0       -    1  /dev/vg00/lvol2
dev     1048576  398656  649920   38%       0       -    1  /dev/vg00/lvol10
localfs 2097152       0 2097152    0% 2097152       0    1  /sw/paging
reserve       - 1522872 -1522872
memory   726944  334488  392456   46%
total   4921248 2657188 2264060   54%       -       0    -

# sar -v
sar: Can't open /var/adm/sa/sa29

I'm assuming that sar isn't supposed to do that... :-) (not a tool I've ever used before.)

> Oh, and do you have SWAP_MEM activated?
As printed above in my problem description:
swapmem_on 1

==========
> James R. Ferguson

> Do you have a lot of orphan processes
> when you do a ps ???

Almost none -- certainly no more than at usual times.

> I vote for looking at Glance too

That seems to have been the general answer around here as well as from you guys. I'll have to lobby harder to get a copy on the next project.

> (toggle 't' for table)!

Will do.

==========
> Chris Anoruo

> Increase the following parameters:

A little rationale would be appreciated... why these parameters? Why to these values? (I would rather be taught to fish than handed a fish sandwich... :-)

> maxfile_lim=4048
(was 2024) Why double this value?

> maxusers=500
(was 400) Is +100 to maxusers is going to make that much of a diff? I've only got 12 people working on this box...

> ntpy=512
(was 400) Again, what will this increase do?

> dbc_max_pct=25
> dbc_min_pct=5
Already 50% and 5% respectively.

> nbuf=0
> bufpages=0
Already there, thanks.

> Look at your shared memory parameters
> and increase from the default values.

And how will this help?

==========

Thanks again; looking forward to the next round.

James R. Ferguson · ‎09-29-2000

Tim:

A trial version of Glance can be obtained from your Application CDROM. Sorry, but I'm not sure which one at the moment. When you get Glance installed, a question mark (?) will trigger the presentaion of a help menu. On that menu you will see the offering for "t - system tables". It is that to which I refer. Regards, Jim.

...JRF...

Alan Riggs · ‎09-29-2000

Sorry I missed the swapmem_on line in your problem description. The swapinfo indicates heavy usage for a system with 1GB memory, you have definitely exhausted your initial device swap and are overutlizing your RAM (ideally, swap utilization should be 0%, but that's a separate performance issue). If this swapinfo was taken during a time when the system was rejecting forks then it does not sem swap is the culprit.

Glance trial version can be installed off of the standard applications CDs.

No, sar should not do that. It means that the sar history collection is not enabled on the system. You an still run sar interactively, sar -v 5 20. That will show you the current state and will be useful for future troubleshooting. If you want to enable the sar collection for histories add teh following lines to root's crontab and make usre that the /var/adm/sa directory exists and is writable:

#
# Caputure system data for sar
#
0 * * * 0,6 /usr/lbin/sa/sa1 1200 3
0 8-17 * * 1-5 /usr/lbin/sa/sa1 900 4
0 18-7 * * 1-5 /usr/lbin/sa/sa1 1200 3
45 23 * * 1-5 /usr/lbin/sa/sa2 -s 0:00 -e 23:30 -i 3600 -A
15 6 * * * find /var/adm/sa -name 'sa*' -mtime +7 -exec rm {} ; > /dev/null 2>&
1

Customize the lines to your need: In my case the 3 ../sa/sa1 lines define how often I want readings taken during the day: every 15 minutes (900 seconds) during business hours, every 20 minutes after hours and weekends. The ../sa2 line writes the daily history reports (formatted and raw) into the /var/adm/sa directory. The find command simply removes those files more than a week old, since I do not use them for long-term system documentation.

Happy fishing.

John Palmer · ‎09-29-2000

Tim,

I'm pretty certain that all versions of Glance are on Application CD number 2.

You want the software set called 'Trial GlancePlus' or something similar. It doesn't need a reboot and is good for 60 days from when you start using it.

glance itself will tell you how much RAM you have, the 't' command gives system tables 'f' gets you the second page. 'h' and '?' are useful for help.

Another easy way to get your RAM (provided the system message buffer hasn't been overwritten since the last reboot) is the command 'dmesg' and as you're on 11 the pertinent messages are also written to /var/adm/syslog/syslog.log.

Tight lines...

Devbinder Singh Marway · ‎09-30-2000

A thought, even though you have maxuprc set to 6000 ( this is a high value and you still get fork process failure) because this value indicates the maximum number of processes any ONE user can have running. It could be that one of the user created programs is forking too many processes ,hence the error and lock up .Is there a process clocking up a lot of time ?
Also you can cron 'vmstat 5 5 ' redirect it to a file and run it every hour/20 mins , to see at which times swap is high, user processing is high , so you can narrow down what was running at that particular time ( which may be the cause)?

Another utility that you can look at is 'ipcs' and monitor who is using the shared memory and whether one user is locking all of it , ( more details on ipcs - do man ipcs)

regards Dev

Seek and you shall find

Lynn Calback · ‎09-30-2000

I have seen this problem before on an HP10.20 box. A run away cron job ate all the available processes and thus the message can not fork occurred. In this scenario new users cannot sign on and no new processes, cron or otherwise could be started. Check the cron log in /var/adm/cron/log when you see this problem again. You will see the error messages you mentioned in this log if this is the cause of your problem.
You may modify the queuedefs file in /var/adm/cron to increase the processes for at, cron and batch jobs.

Shane Travis · ‎10-13-2000

Okay. First of all, Thanks again to everyone who has replied -- and who has bothered to read down this far. I was going to start a new topic and just reference this one, but I figured better to have it all in one place. Still, it makes for a lot of screens of stuff.

First, a quick response to a couple of the later points:

John Palmer
===========
dmsg! That's the command I was trying to remember. The relevant output from that command shows this for system memory:

Memory Information:
physical page size = 4096 bytes, logical page size = 4096 bytes
Physical: 1048576 Kbytes, lockable: 724060 Kbytes, available: 841844 Kbytes

Thus, it looks like I've got about 1 gig of memory.

Devbinder Singh & Lynn Calback
==============================
No, it's not a runaway cron job, nor a runaway user with too many forks. I can guarantee the former, as cron is not activated for anyone but root. I'm almost 100% sure of the latter, as I know what sorts of stuff we're developing and nothing should be doing that sort of forking. Also, it happens too regularly, and under too similar of a conditions to believe that it's anything more than a resource problem.

I now have the trial version of Glance Plus operating on the machine with the problem. (Thanks to those who pointed out that it existed.) I've attached a .gif file of the system *while* it was in the 'outage' state.
During this time I was receiving the following messages:

- The fork function failed. Too many processes already exist.
- sort: There is not enough memory available to perform the sort.
- There is not enough memory available now.
- No more processes.

Things I've noticed:
1) The system always starts acting flaky when the swap-space utilized gets up around 73%. I have never seen it go over 75% swap-usage, and I wouldn't expect to be getting this sort of grief unless I was right near 100%.
2) There were even fewer processes running this time than last time we were running into these problems: GP showed only about 345 procs. That leads me to believe that it is NOT a problem with the number of processes, but with memory. Confirmation? Analysis? Suggestions?

Alright, that's all I can think to throw out there just now. I can probably answer any questions better now than I could 2 weeks ago; I've got more experience with the "glitch" and am armed with better stats.

Appreciate the assistance.

Shane Travis · ‎10-13-2000

Okay. First of all, Thanks again to everyone who has replied -- and who has bothered to read down this far. I was going to start a new topic and just reference this one, but I figured better to have it all in one place. Still, it makes for a lot of screens of stuff.

First, a quick response to a couple of the later points:

John Palmer
===========
dmsg! That's the command I was trying to remember. The relevant output from that command shows this for system memory:

Memory Information:
physical page size = 4096 bytes, logical page size = 4096 bytes
Physical: 1048576 Kbytes, lockable: 724060 Kbytes, available: 841844 Kbytes

Thus, it looks like I've got about 1 gig of memory.

Devbinder Singh & Lynn Calback
==============================
No, it's not a runaway cron job, nor a runaway user with too many forks. I can guarantee the former, as cron is not activated for anyone but root. I'm almost 100% sure of the latter, as I know what sorts of stuff we're developing and nothing should be doing that sort of forking. Also, it happens too regularly, and under too similar of a conditions to believe that it's anything more than a resource problem.

I now have the trial version of Glance Plus operating on the machine with the problem. (Thanks to those who pointed out that it existed.) I've attached a .gif file of the system *while* it was in the 'outage' state.
During this time I was receiving the following messages:

- The fork function failed. Too many processes already exist.
- sort: There is not enough memory available to perform the sort.
- There is not enough memory available now.
- No more processes.

Things I've noticed:
1) The system always starts acting flaky when the swap-space utilized gets up around 73%. I have never seen it go over 75% swap-usage, and I wouldn't expect to be getting this sort of grief unless I was right near 100%.
2) There were even fewer processes running this time than last time we were running into these problems: GP showed only about 345 procs. That leads me to believe that it is NOT a problem with the number of processes, but with memory. Confirmation? Analysis? Suggestions?

Alright, that's all I can think to throw out there just now. I can probably answer any questions better now than I could 2 weeks ago; I've got more experience with the "glitch" and am armed with better stats.

Appreciate the assistance.

john strumila · ‎10-17-2000

first thing I would do would check my:
maxdsiz
maxssiz
max?siz (cant remember but similar to previous 2)

total "malloc"s are limited by heap, stack and data/code.

Carlos Fernandez Riera · ‎10-17-2000

Perhaps there is more than one cause:

No more process:
this tell you that proc table is full, so in any moment you have about 6000 processes running.

Why? , Any process is spwaning lots of new processes, when that process die all its child process will die too. Then you run ps -aef | wc and there is only 400 processes.

Run sar -v 5 500 and see how processes grow. You must be sar -v working lots of time because when then this problem is present you can not open a new process (No more process).

Can not allocate memory:

A process request memory allocation but request fail. There is no more memory. Maybe due to lots of running process or by a process requesting memory in a loop or a by a request for a great amount of memory.

Cant not fork:

a new process request for memory allocation but there is no more space to reserve.

A simple script can cause this errors:

while [1]
do
sleep 10 &
done

How many processes can create this script?

up to maxproc.

unsupported

Suhas_2 · ‎10-18-2000

Tim,

We had faced similar prroblem.Pls check whther any user is using a program that is forking child processes in an infinite loop. You need to identify and take the person to task.

Regd...

Never say "Die"

Tommy Brown · ‎10-18-2000

Hi, TIm, You asked about how to fish..
I can' show you how, but I think that I found a point of confusion in you parameters.
More is not always better Especially in > dbc_max_pct=25
> dbc_min_pct=5
Already 50% and 5% respectively.
This parameter sets buffer memory reservation. I don't know why default is 50%, but In Perf and Tune Classes and elsewhere, it is recommended to reduce to 25% or maybe less depending upon your systems function. Buffer cache is Block I/O, and may not be heavily used in some environments.. a couple of Documents in HP are:
http://hp3.m0.net/m/s.asp?H2271218049X871330
http://hp3.m0.net/m/s.asp?H2271218049X871331
http://hp3.m0.net/m/s.asp?H2271218049X871332
These are performance tuning documents..
Tommy

I may be slow, but I get there !

Tommy Brown · ‎10-18-2000

Hi, TIm, You asked about how to fish..
I can' show you how, but I think that I found a point of confusion in you parameters.
More is not always better Especially in > dbc_max_pct=25
> dbc_min_pct=5
Already 50% and 5% respectively.
This parameter sets buffer memory reservation. I don't know why default is 50%, but In Perf and Tune Classes and elsewhere, it is recommended to reduce to 25% or maybe less depending upon your systems function. Buffer cache is Block I/O, and may not be heavily used in some environments.. a couple of Documents in HP are:
http://hp3.m0.net/m/s.asp?H2271218049X871330
http://hp3.m0.net/m/s.asp?H2271218049X871331
http://hp3.m0.net/m/s.asp?H2271218049X871332
These are performance tuning documents..
Tommy

I may be slow, but I get there !

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Running out of some resource... but what?

Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?

Re: Running out of some resource... but what?