Operating System - HP-UX
1822363 Members
5370 Online
109642 Solutions
New Discussion юеВ

Re: auto kill of runaway processes

 
Colin Carwardine_1
New Member

auto kill of runaway processes

I have an N class and L class with many heavy SAS users. Have seen when users exit with out closing the SAS application properly that 'runaway' processes are created. A few of these and it drops the 6 cpu box to it's knees. These "always" show up also as defunct and I was wondering if anyone had a script to clean these up? Only restriction...cannot be one that simply looks at processing times. We have processes that run for 4 days so a limit cannot be set to use that as critera for killing. I'm wondering about one that checks for defunct processes .. logs them by PID and user... checks again in 5 minutes...(strike 2) and again in 5 minutes if it's there... strike three!! it is killed.

Anyone have thoughts or script out there that might fit the bill? Getting sick of killing these by hand. Many users rotate through our platform here and I try to explain exit protocol but these happen every day... if left too long...they really drop the performance level.

Thanks.

Thanks.
It's ALL pensionable....
17 REPLIES 17
Charles McCary
Valued Contributor

Re: auto kill of runaway processes

defunct or "zombie" processes can only be cleared via reboot.

Colin Carwardine_1
New Member

Re: auto kill of runaway processes

Not sure what you mean ... I've been killing these with kill -9 PID# and that has been solving our problem for 2 mths now. performance returns to normal.
It's ALL pensionable....
Charles McCary
Valued Contributor

Re: auto kill of runaway processes

Well most of the time you cannot kill defunct processes, must be something different with the parent process of your "defunct" children processes.

You should be able to kill these based on your criteria then, here's some pseudocode:

1) I'd build a list of processes, call it 5_min_list.

sleep 5 minutes
2) check to see if the processes in your five minute list are still running, if they are kill them.

3) restart at step 1.

Hope this helps
Charles McCary
Valued Contributor

Re: auto kill of runaway processes

Here's one way to do it:


while true
do
ps -eaf | grep defunct | awk '{print $2}' > PIDS

sleep 300

for PID in `cat PIDS`
do
IS_IT=`ps -eaf | grep $PID`
if [ "${IS_IT}" = '' ]
then
echo "no longer running"
else
kill $PID
fi
done
done
someone_4
Honored Contributor

Re: auto kill of runaway processes

hi,
can you post the ps -ef|grep output of what you get killing each time?

If you want to kill by program name you can do:

#!/bin/ksh
pid_var=`ps -ef | grep program_name | grep -v grep | awk '{print $2}'`

kill -9 $pid_var



Richard
Paula J Frazer-Campbell
Honored Contributor

Re: auto kill of runaway processes

Hi Colin

This may help:-

-----------------caut-------------

#!/bin/sh
# Get info on the users
# Change the the vdx to suit particular user

ps -ef | grep " vdx" | grep -v grep | awk '{print $2, $5, $7}' | while read pid
time cpu
do
# Strip the ':'
cpu00=`echo $cpu | sed 's/://'`
time00=`echo $time | sed 's/://'`
time000=`echo $time00 | sed 's/://'`
currenttime=`date "+%H%M%S"`
ddate=`date`
#
# Calculate total time connected
timeon=`print $currenttime - $time000|bc`
#
# CPU usage
# 180 seconds
if [ $cpu00 -gt 180 ]
then
echo $ddate
echo EXCESSIVE CPU USER ON : $pid
kill $pid # Gracefull kill
sleep 5
kill -9 $pid
fi
done
-------------------cut-----------------


Paula
If you can spell SysAdmin then you is one - anon
Jeffrey Davis_1
Frequent Advisor

Re: auto kill of runaway processes

Hi Colin. We've had this difficulty in the past. There is a utility with SAS that you can run to reset/remove all rogue SAS processes. I believe it is called 'cleanwork' or close to that. It should be under the bin directory for SAS. Check it out.
Colin Carwardine_1
New Member

Re: auto kill of runaway processes

Hi all... thanks for the response ... will have to look at these suggestions and get back to you. It's a real pain to have to get of our internal secure network .. switch to the external network and then login.

Paula and Charles... thanks for the script .. will have a look. should help us.

Richard .. can't kill all processes running SAS .. they'd kill me here. Need to identify individual PIDs that are defunct but still consuming CPU and time. They look and act like a valid production job except they never end... and come up as defunct.

Jeffrey .. SAS is gonna drive me nuts some day.. had them involved.. no response yet. Will use your info to ride them a little harder.

This is a GREAT site. Thanks to all ... will let you know more.
It's ALL pensionable....
Jeff Schussele
Honored Contributor

Re: auto kill of runaway processes

Hi Colin,

Welcome to the forum.
We're all here to help each other.
Contribute as much as you can - in both directions.
And don't forget to award points to those who help you - as those whom you help will award you.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Woo Kim Chye
Occasional Advisor

Re: auto kill of runaway processes

We also encountered this problem with our N-class server last time. The user just click the [X] button to close the application window and the processess became stray.

What we did is that we wrote a cronjob to grep those processes and check their PPID. The stray processes have PPID set to 1. We then kill these processes.
Yogeeraj_1
Honored Contributor

Re: auto kill of runaway processes

hello,
just to share my experience.

Oracle 8i (8.1.7) has "Dead Connection Detection" (DCD). It seems like that it solves many of the lost connection problems that can occur.

Regards
Yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
Frank Slootweg
Honored Contributor

Re: auto kill of runaway processes

> Need to identify individual PIDs that are defunct but still consuming CPU and time.

As others have mentioned, you should post an example ps(1) output and indicate which processes you want to identify/kill.

You use the word "defunct", but not with its normal/defined meaning:

A (*real*) *defunct* process *can not* "consume CPU", so you must be refering to some other kind of process.

For a description of what defunct (zombie) processes are, see "zombie process" in the glossary(9) manual page "man glossary").
Juan Manuel L├│pez
Valued Contributor

Re: auto kill of runaway processes

My experience on that problems is just that a defunt process is a father depending process, and if you stop and start the father process, then your defunt process will dead to.
That??s mean that it is NOT necesary to reboot the machine to kill the defunt process.
There is not any other way to kill a defunt process.
Try it and tell me your experience.
I hope this help you.
Juanma.
I would like to be lie on a beautiful beach spending my life doing nothing, so someboby has to make this job.
Peter Kloetgen
Esteemed Contributor

Re: auto kill of runaway processes

Hi Colin,

first you have to check out, if any of your SAS parent processes need a signal of users child processes when they are ended. (signal SIGCHLD) If not, you could start the parent processes using the trap- command to avoid these signals send by the child- processes.

--> trap "" SIGCHLD

this commad has the effect, that no signal SIGCHLD sent by child- processes will ever reach the parent process. Normally child processes stay something like "zombies" for some milliseconds, till this signal reaches their parent process. When this signal is trapped, they change their behaviour and die immediatly.

kill -l --> output are all defined signals with names and numbers

man trap --> how to handle trap- command

Perhaps you could check out with some SAS- guys, if this is ok for SAS.

Hope this helps you!

allways stay on the bright side of life!

Peter
I'm learning here as well as helping
Colin Carwardine_1
New Member

Re: auto kill of runaway processes

WE GOT IT !!

took some of your ideas and one of the guys here ran with.... tested and it works. In place now and kills runaway SAS processes that are flagged as 'defunct' after they've been there for 15 minutes. At first we only checked on "defunct" and found out quickly that cron was killed. Oh...OH.
Cron submits on the half hour and sends msg to log when SAS "defunct" process is found... checks 15 min. later - if still there, it's killed and msg sent to a log.

we may still have the occasional runaway with elm or xterm but we may add those conditions later. Already.. it's made my life a lot easier.

Thanks to all for your ideas and suggestions - have attached the script below.

Cheers.

Colin.

-------------------------------------
#!/bin/ksh

LOG="/var/adm/syslog/defunct_kill.log"

LogMsg () {

echo "$1"
echo "`date` --- $1" >> $LOG

}

find_defunct () {

AllPIDS=`ps -eaf | grep -i defunct | grep -v grep | awk '{print $3}'`

UniqueParentPIDS=""
for PID in $AllPIDS; do
if [[ $UniqueParentPIDS = +(*$PID*) ]];then
# We've already recorded this PID (many defunct could have same parent)
continue
fi
UniqueParentPIDS="$UniqueParentPIDS $PID"
done

SasPIDs=""
for PID in $UniqueParentPIDS; do
ParentOwner=`ps -eaf | grep $PID | grep sas | grep -v grep | awk '{print $1}'`

# We don't kill anything owned by root
# We don't keep anything that has no Owner (PID wasn't SAS or was shut down)
if [[ "$ParentOwner" = "root" || "$ParentOwner" = "" ]];then
continue
fi

SasPIDS="$SasPIDS $PID"
done

echo $SasPIDS
}


kill_defunct () {

OriginalDefunctSasPIDS=`find_defunct`

if [ "$OriginalDefunctSasPIDS" = "" ]; then
LogMsg "No Defunct Processes"
exit
fi

LogMsg "Found the following parents with defunct children: $OriginalDefunctSasPIDS"

sleep 900

CurrentDefunctSasPIDS=`find_defunct`
LogMsg "The following parents with defunct children are present after 15 minutes: $CurrentDefunctSasPIDS"

for PID in $OriginalDefunctSasPIDS; do
if [[ $CurrentDefunctSasPIDS = !(*$PID*) ]];then
LogMsg "$PID is no longer defunct or has been killed"
continue
fi

LogMsg "Killing $PID"
kill $PID
sleep 5
kill -9 $PID
done

}

kill_defunct


It's ALL pensionable....
A. Clay Stephenson
Acclaimed Contributor

Re: auto kill of runaway processes

I see one glaring problem with your code. You are using kill -9. This should be your weapon of absolute last resort. Kill -9 does not cleanup and shared memory segments can be left attached, temp files left, ... .

You should really send a set of escalating signals to a pid in this order: 15 1 2 3 11 9.
Kill -11 is almost as sure a kill as kill -9 and does cleanup.

The best procedure would be to send a kill signal ${pid} then sleep a few seconds and send a kill -0 ${pid}. A zero return indicates that the process is still active so send the next signal and repeat the process. Only when all signals are used should you finally send the kill -9.
If it ain't broke, I can fix that.
Colin Carwardine_1
New Member

Re: auto kill of runaway processes

Thanks Clay. You're right. It wasn't a big deal because we reboot regularily each week but this will definitely keep things cleaner. Thanks again for the modification.

Colin.
It's ALL pensionable....