Re: Script Help! Trap and Kill those run-away processes!!

Yogeeraj_1 · ‎10-27-2005

Dear Experts!

My mission is to Trap all the "run-away" application processes and "kill -9" them.

In fact, we have observed that during some time interval periods

CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
0 ? 17750 ias122 154 20 227M 183M sleep 54:58 83.47 83.32 f60webm

Processes (f60webm) which have SIZE and RES in terms of M's should be terminated! They are run-away processes resulting from a client crash... a bug in the software..

Any idea how to best do that?

[sorry i have little scripting knowledge]

thanking you all in advance for a reply.

kind regards
yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

Eric Antunes · ‎10-27-2005

Hi Yogeeraj,

For starting, may be you should test is the following command returns the top output correctly:

top -d 12 -n 12 -f top_out.txt

PS: I have also little scripting experience so don't expect faster resolution. Meanwhile, I will be learning too.

Best Regards,

Eric

Each and every day is a good day to learn.

Ralph Grothe · ‎10-27-2005

I am not sure what criterion to use to determine if a process is a runaway,
but I'm sure you will know.

e.g. just filtering those procs whose vsize is bigger than 10m and whose command string matches web then you possibly could process them like that (but be careful with the kill (especially SIGKILL)

UNIX95= ps -e -o vsz= -o pid= -o ppid= -o comm=|awk '50 < $1/2^10 && $4~/web/'|read vsz pid ppid comm; do
# do further filtering or signalling here
done

HTH

Madness, thy name is system administration

James R. Ferguson · ‎10-27-2005

Hi Yogeeraj:

Ralph's use of the UNIX95 (XPG4) variant of the 'ps' command is most appropriate and His skeletal script should provide what you need to get started.

PLEASE do not do 'kill -9' without first attempting to do a simple 'kill' first.

A 'kill -9' cannot be caught and thus a process has no (programatic) chance to cleanup temporary files and/or shared memory segments.

Instead, do something like:

kill mypid > /dev/null 2>&1
sleep 3
kill -9 mypid > /dev/null 2>&1

If the first 'kill' works "mypid" will no longer be valid and the second 'kill' will be a "no-op". If the first 'kill' fails, then the second will terminate the process as desired (unless it is waiting on an I/O to complete or in some other kernel state).

You may wish to issue a simple 'kill' and then escalate to 'kill -1' (SIGHUP) and then if that fails, a 'kill -9' (SIGKILL). I have found this useful too.

Regards!

...JRF...

Bill Hassell · ‎10-27-2005

And whatever you do, do *NOT* use -e as your selection. Otherwise, you may accidently kill a kernel process or a backup program or a network daemin, etc. Instead, you know the name of the program (f60webm) so let ps do what it does best: find the program for you. Use the option -C f60webm so you only look at the problem program. Then use a 3 level kill, kill -15, then if that doesn't work, kill -1 and finally kill -9 if all else fails.

Bill Hassell, sysadmin

Eric Antunes · ‎10-27-2005

Yogeeraj,

Indeed Raplh's command is a better way to go (I knew I was going to learn more than help...)! You can also try to show more columns from ps like state and flags for example:

UNIX95= ps -e -o vsz= -o pid= -o ppid= -o time= -o state= -o flags= -o comm|grep 'f60'|awk '20 < $1/1024'

Ralph,

Any reason for 2^10 instead of 1024 besides the binary one?

Best Regards,

Eric

Each and every day is a good day to learn.

Hanwant Verma_1 · ‎10-27-2005

Hi

The following snippet should get you started on the 'guts' of your script to determine if your processes are out of control

#top -d 1 -h -u -f /tmp/top.tmp
#awk '$13~/BAD_PROCESS_NAME/ {print $12}' /tmp/top.tmp

this will print out the percentage of cpu time used by a process named BAD_PROCESS_NAME.

Cheers
Hanwant

Ralph Grothe · ‎10-27-2005

@ Eric,

no, it was a silly choice since exponantiation is processing-wise most expensive I guess (I think some sort of Taylor's row approximation is used).
Actually, exponantiation to base 2 is probably quickest achieved by bitwise right shifting (for negative powers), but I didn't know how this is done in awk (in Perl there's the >> operator).
So devision by 1024 is much better.

Apart I would strongly agree with Bill's statement not to scan all processes by -e,
but instead restrict to the user's procs who is running the notorious *web procs (i.e. -u user|uid).
And be extra cautious with sending SIGKILLs (i.e. -9) as it could result in some orphans.
Use the suggested three-step kill.
You can check if the process survived after having it sent a signal by resending it -0

e.g.

sleep 10
if kill -0 $pid 2>/dev/null; then
# probably next kill level
fi

Madness, thy name is system administration

Eric Antunes · ‎10-28-2005

Ralph,

Thanks for the explanation.

Each and every day is a good day to learn.

Yogeeraj_1 · ‎10-30-2005

hi all,

thank you for your precious replies.

Especial thanks to Ralph for his insight and great explanations.

can you please clarify. Does your script include the part that will be checking that the SIZE and RES are in magnitudes of Ms (e.g. 227M or 183M)?

also, grateful if you can explain the "3-step kill process"

thanking you in advance

kind regards
yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

RAC_1 · ‎10-30-2005

My 2 cents.

What you see in glance in RSS column, is not what you get in UNIX95. RSS includes shared ememory size used by program. UNIX95 does not include that. If RSS column has 3.5GB, you would never see that in UNIX95.

You should better use the alarmdef to notice the runaway process. An example of it is in /opt/perf/examples/adviser

Also, the kill -9 is not good idea.
I start as follows.

kill -1
kill -2
kill -3
kill -11 and last kill -9

kill -11 is equally effective and does cleanup work which -9 does not.

There is no substitute to HARDWORK

Yogeeraj_1 · ‎10-30-2005

hi,

thank you again for your reply.

I usual use TOP to identify the processes to be "killed"...

Am attaching a snapshot whereby the following process can be identified to have run-away and that should be killed: 14203 , 13399 and 1102

more guidances would be most appreciated!

kind regards
yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

Ralph Grothe · ‎10-30-2005

That really sounds eccentric, tearing a process down with a SIGSEGV as last resort,
as suggested by RAC.
I would to the contrary rather expect it to mess up the scene with a coredump, as memeory reference vialations usually go, than cleanup.
But I know the cleanup was meant differently here, and if it works better in that respect,
why not?

I know that the vsz that ps lists don't contain shared memory or memory mapped pages.
According to ps manpage it comprises of text, data, and stack segments in KB unlike in memory pages as the sz option would.
Thus devision by 1024 should show the sum in MB and reduce the necessary arithmetics instead of the extra multiplication by page size.

With three step kill I was referring to Bill's three level kill.
i.e. signal in this sequence with intermediate checks if the process still is alive:

1. SIGTERM (-15)
2. SIGHUP (-1)
3. SIGKILL (-9)

Madness, thy name is system administration

RAC_1 · ‎10-30-2005

I think the real work is "to decide what is a runaway process" In your case, It may be that the parent processes is waiting for child process, but child process has died and parent needs to be killed." While in other cases it may be a process than is taking more cpu/memory that it should.

Once you decide that then it becomes easy.

glance is a very good tool at this, you can define a adviser which when certain limits (as decided by you) are crossed can email/page you about a culprit process.

There is no substitute to HARDWORK

Yogeeraj_1 · ‎10-30-2005

hi both,

i agree that Glance may be a better tool.

what i really want is an algorithm/script which can be safely used to kill those run-away processes.

the main reason being that we operate 24x7 and we don't have resources available to see in front on the screen to monitor any such alerts!

am attaching a snapshot of a graph plotted from data extracted from measureware! You will see what happened that Sunday!!

please guide me further!

kind regards
yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

Ralph Grothe · ‎10-31-2005

Hi Yogeeraj,

I didn't know you were using MWA already.
Then it's probably best you use this excellent tool.
I think you could write an appropriate event handler script that would safely kill those runaway procs of yours.
But it requires a little reading and experimenting until things will be triggered by MWA correctly.
Have a look at /opt/perf/paperdocs/mwa/C.
There you will find the MWA User's guide as PDF and PS as well as PDF and plain ASCII files describing the monitorable system metrics by MWA.
(e.g. PROC_MEM_RES)
Get acquainted with these docs.
Then you can edit /var/opt/perf/alarmdef
(make a backup copy first)
In that file you will find a well commented set of pre defined general purpose alarms.
There's also in the comments an example of how to trigger an email notification as a very rudimentary "event handler".
Intstead of this you could trigger your kill script.
To have a quick reference of the so called adviser syntax you could also start glance and tab to its help screen -> adviser information.
MWA uses the same syntax for its alarmdef file.
Before you restart MWA with your new definitions you should check it for syntax errors with the "utility -xa" command.
Maybe it's a good idea during the trial phase before your handler does the actual killing to have it first send an echoed kill line of the PIDs it would kill to you as an email so that you could check the right procs would be killed.

Alternatively you could have your processes' memory consumption monitored by freely available tools such as Mon or Nagios.
Especially Nagios is a great system monitor.
For instance it offers a ready made check_procs plug-in that can be passed "-m RSS" or "-m VSZ" as arguments, and that will trigger alerts whenever the Warning (-w ) or Critical (-c ) state thresholds are exceeded.
In the Nagios doc there is also an example how to write an event handler script that restarts a failed webserver.
Best of all Nagios maintains a very active mailing list with many users and developers participating in and willing to help.

But I'm afraid the first setup usually involves some work, no matter what solution you prefer.

Madness, thy name is system administration

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Script Help! Trap and Kill those run-away processes!!

Script Help! Trap and Kill those run-away processes!!