Re: Temporarily unable to telnet or ssh or console login into a system

Andrew Kaplan · ‎11-18-2009

Hi there --

We are running HP-UX 11.00 on an L2000 class server, which became temporarily inaccessible via telnet, ssh or console login.

The first sign of trouble was when our monitoring system, nagios 3.1.2, was getting a large amount of UNKNOWN status messages from the server. When I tried to login via ssh and telnet, the terminal window would briefly appear and then immediately close. When I tried to login at the console, the message I received was the following:

prngd could not fork, not enough space

About five minutes later, I was able to log into the system via telnet. The first thing I checked was the status of the filesystems with special interest for those in the vg00 volume group. There were none either in the aforementioned volume group or other volume groups that were at 100 percent capacity. The output of the top command indicated as idle time for both CPU's above ninety percent. A check of the syslog.log file, aside from the prngd error message, did not indicate any abnormalities in the past hour.

The issue of the nagios server getting the UNKNOWN output has been going on for several days, but today is the first time the login issue has occurred with the system.

The only recent change that has been made to the system is the following: The system in question is the Master NIS server of one, until recently, two domains. The master and slave nis servers of the other domains were added, both as slaves, to the domain of the system in question. This eliminated the other domain. An edited version of the procedure that was followed to complete this task is included as an attachment.

Has anyone seen something like this, and if so, what other steps should I take to investigate this matter?

A Journey In The Quest Of Knowledge

Rita C Workman · ‎11-18-2009

Could not fork could indicates it could not fork (or create) another process. Possibly your 'nproc' parm is set too low so it could not fork another process. I doubt the issue is full file system based on this message.

Or maybe you don't have enough memory to handle the load.

Or maybe some application was spawning some process over and over and over again...thus creating this problem.

Try running some sar commands to see how much is going on.
sar -v 1 50

Will give you / for inodes; process; files.
It may help you in checking if it's a parm issue.

What is vmstat telling you - have you been "po" paging out. That is a bad thing. So run:
vmstat -nS 1 50

Look under page for 'po' - if that number is growing or shows too much activity, you need to see if you need to increase memory or add some swap maybe. So how much swap do you have set up. Run:
swapinfo -tam

Just a few thoughts, I hope they help get you started.

Rgrds,
Rita

Andrew Kaplan · ‎11-18-2009

Hi there --

Thanks for your reply. I ran the commands you suggested, and here are the results:

The output from the sar command indicated the
text-sz, proc-sz, inod-sz, and file-sz columns had readings that did not indicate resources being heavily used. The average usage is shown below:

text-sz ov proc-sz ov inod-sz ov file-sz ov
N/A N/A 170/667 0 3397/7384 0 2958/12063 0

The vmstat output, with special attention being made to the page po column, showed there was little or no activity occurring there. Listed below is a sampling:

VM
memory page faults
avm free si so pi po fr de sr in sy cs
43150 33714 0 0 0 0 1 0 0 968 1662 291

The faults section did show some activity. The numbers there tended to yo-yo in size for the duration of the command. I am not familiar with this command so I don't know if the numbers shown there are normal or indicate a problem.

The swapinfo command returned the output shown:

Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 1024 0 1024 0% 0 - 1 /dev/vg00/lvol2b
reserve - 1012 -1012
total 1024 1012 12 99% - 0 -

The output indicates the reserve is using all its available memory while the dev is not using any.

What are your thoughts? Thanks.

A Journey In The Quest Of Knowledge

Matti_Kurkela · ‎11-18-2009

The "not enough space" part in the "could not fork, not enough space" error message refers to a critical lack of free RAM. The fact that your swap is 99% reserved suggests this too.

You don't seem to have the swapmem_on kernel parameter enabled: how much RAM does your system have? If swapmem_on is not enabled and there is less swap than RAM, your system is restricted from using all the RAM. This is because HP-UX always reserves a place in swap for all the RAM it allocates, so that the kernel always has a place to put the data if it needs the memory for something else. In other words, unlike some other operating systems, HP-UX does not over-commit memory by default.

The swapmem_on parameter enables a memory management trick that allows the use of RAM even if there is not enough swap for all of it. In modern systems, this parameter should usually be enabled. (I think it was enabled by default in 11.11; in 11.31, the parameter was removed altogether.)

If there is no significant page-out activity (the po column in vmstat output), I might suspect that some long-running process has been slowly leaking memory and has now consumed almost all free space in the system.

If your monitoring system produced long-term memory usage graphs, a memory leak would be easy to see: if the free memory level of the system creeps slowly and steadily downwards while the system load remains the same, the most common cause is a memory leak in an application.

The next step would be trying to find out where all the RAM (and swap) has gone. Use Glance, top or ps with suitable options so that you can see the Virtual Segment Size of each process. Sort the list by VSS and look at the long-running processes with the largest VSS value.

Stop and restart the process (or the application that the process belongs to) and wait a while, then check the VSS usage of that process/application again. Did it stabilize to roughly the same value as before, or something much lower?

If it's lower, it is possible that this application has been leaking, and the restart just freed the leaked memory and the system can again run happily for a while. If the VSS usage of this process again starts to slowly increase over time and does not level off, you've located a memory leak.

In this case, it's usually time to make a bug report to the developer or vendor of that application. But as you're using HP-UX 11.00, your software is probably already out of support, as is your HP-UX version.

As a work-around, you can restart the process/application periodically to recover the leaked memory. If the leak is slow, one restart per month or even one per quarter might be sufficient to keep the accumulation of leaked memory small enough to be harmless.

A program that uses shared memory resources might also leak them. Shared memory resources are not reclaimed automatically when a program exits, unless the program explicitly removes them. A shared memory leak may be identified by using the "ipcs" command and seeing an ever-increasing number of shared memory resources listed. You can use the "ipcrm" command to recover the leaked shared memory resources, but be sure to stop the application(s) first before starting to remove their shared memory resources.

MK

MK

Dennis Handly · ‎11-21-2009

>total 1024 1012 12 99%
>The output indicates the reserve is using all its available memory while the dev is not using any.

(Yes, you have interpreted reserve correctly, you have at least 1 Gb memory.)
1 Gb isn't enough swap to do anything useful now days.

Besides enabling swapmem_on as MK mentioned, you should also add more swap.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Temporarily unable to telnet or ssh or console login into a system

Temporarily unable to telnet or ssh or console login into a system