Operating System - HP-UX
1833875 Members
1573 Online
110063 Solutions
New Discussion

All Processes Get Killed In HP

 
A P Mohanty
Occasional Contributor

All Processes Get Killed In HP

Hello,
At one of our customer sites, we are facing a peculiar problem. We are testing a set of servers and client programs that read and write to sockets and message queues. The OS is HP 11i running on D-Class and N-Class Servers. At times, all the processes including other login sessions get terminated. We have not been able to locate any logs that may give any clue as to what causes the processes and other login sessions to terminate. The problem occurs when we test only the above mentioned servers. We have other server (both TCP/IP as well as others using message queues and shared memory) applications running and they cause no problems. But when this problem occurs, even other applications that are no way linked to these set of servers also get killed. These new set of servers use excessive message queue and shared memory access. Only difference I could identify is that these servers read and write lots of messages continuously to/from sockets and message queues. If the resource usage is more, the system may hang. But why all the processes across all login sessions simply disappear ? Any clue ?

Regards
A P Mohanty
E-Mail : apm@tc4hq.cmcltd.com

3 REPLIES 3
Michael Tully
Honored Contributor

Re: All Processes Get Killed In HP

If these sessions actually login, have you looked to see that the user logins have a timeout value. You can set a shell variable called 'TMOUT=3600' for example. If the session is idle for this amount of time it logs the user out. Check both the users .profile and /etc/profile
Anyone for a Mutiny ?
Ravi_8
Honored Contributor

Re: All Processes Get Killed In HP

Hi,

is the kernel parameters are set to the values required by message queues application?
never give up
Amit Kureel
Advisor

Re: All Processes Get Killed In HP

Mohantyji,

The most probable reason I feel is that one of the application process is executing kill(pid,signalNo) system call and if you check the man-pages of kill(), you will see the following behaviour (I am reproducing some portion of it):

======== man page of kill() ===

The value KILL_ALL_OTHERS is defined in the file and is guaranteed not to be the ID of any process in the system or the negation of the ID of any process in the system.

If pid is greater than zero and not equal to KILL_ALL_OTHERS, sig is sent to the process whose process ID is equal to pid. pid can equal 1 unless sig is SIGKILL or SIGSTOP.

If pid is 0, sig is sent to all processes excluding special system processes whose process group ID is equal to the process group ID of the sender.

If pid is -1 and the effective user ID of the sender is not a user who has appropriate privileges. sig is sent to all processes excluding special system processes whose real or saved user ID is equal to the real or effective user ID of the sender.

If pid is -1 and the effective user ID of the sender is a user who has appropriate privileges, sig is sent to all processes excluding special system processes.

If pid is KILL_ALL_OTHERS, kill() behaves much as when pid is equal to -1, except that sig is not sent to the calling process.

If pid is negative but not -1 or KILL_ALL_OTHERS, sig is sent to all processes (excluding special system processes) whose process group ID is equal to the absolute value of pid, and whose real and/or effective user ID meets the constraints described above for matching user IDs.

================================

You should be able to guess that because of some erroneous condition some application variable is erroneously taking negative value and is thus ending up killing all the processes for that user. Now the question arises : How to find the culprit process ?

Well, the most straightforward approach that comes to my mind is setting up the auditing for kill() system call on the test-system. This can be done by following the attached Word Document.

Hope it helps.

Amit Kureel
amit_kureel@yahoo.com