Operating System - HP-UX
1834935 Members
2582 Online
110071 Solutions
New Discussion

multiplying defunct processes

 
Evelyn Daroga
Regular Advisor

multiplying defunct processes

Thought I posted this, but can't find it so reposting; sorry if its a duplicate:

We patched the system (HPUX B.11.00 U 9000/800) last weekend (after several years) and the process we have always used to kill users still logged in is suddenly causing rapidly multiplying defunct processes.

The shell script that runs nightly basically does this:
1. Find all â kshâ processes and kill the ppid and pid (kill â 9 ppid pid)
2. Find any remaining processes locally attached to the Oracle db and kill the ppid and pid.
3. Find any remaining processes non-locally attached to the Oracle db and kill the ppid and pid.

In June, we stopped executing the first step, and only looked for processes actually attached to the db and killed them. No problems. After hitting this problem this week, the first step was reinstated, and then altered to only kill the ppid (not the pid) of the â kshâ process, but that did not help. Here is a summary of what happens:

â Normalâ user session:
root 4095 955 0 08:59:28 pts/tad 0:00 telnetd
upmay 4096 4095 0 08:59:30 pts/tad 0:05 -ksh
upmay 17366 4096 0 12:08:27 pts/tad 0:07 quick subdict=search auto=/uk_home/jervis/v63yoln/MENUGO.qkg
upmay 17382 17366 0 12:08:29 ? 0:00 oracleUK (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Now the kill script kicks in:
upmay 4096 1 1 08:59:30 ? 0:05 -ksh
upmay 1918 1 0 19:01:44 ? 0:00
upmay 2362 4096 0 19:01:45 ? 0:00 -ksh

And 10 seconds later:
upmay 4096 1 1 08:59:30 ? 0:05 -ksh
upmay 7649 1 0 19:02:06 ? 0:00

And about 20 seconds later:
upmay 4096 1 0 08:59:30 ? 0:05 -ksh
upmay 7651 1 1 19:02:06 ? 0:00
upmay 7649 1 0 19:02:06 ? 0:00
upmay 9016 4096 0 19:02:11 ? 0:00
upmay 9493 1 0 19:02:13 ? 0:00
upmay 9492 1 0 19:02:13 ? 0:00
upmay 11412 1 1 19:02:21 ? 0:00
upmay 11411 1 0 19:02:21 ? 0:00
upmay 13228 1 1 19:02:28 ? 0:00
upmay 16543 1 0 19:02:41 ? 0:00
upmay 13156 1 0 19:02:27 ? 0:00
upmay 13624 1 2 19:02:29 ? 0:00
upmay 13623 1 0 19:02:29 ? 0:00
upmay 17020 1 1 19:02:42 ? 0:00
upmay 15600 1 2 19:02:37 ? 0:00
upmay 17935 1 1 19:02:46 ? 0:00
upmay 15527 1 1 19:02:37 ? 0:00
upmay 16946 1 0 19:02:42 ? 0:00
upmay 18311 1 0 19:02:47 ? 0:00
upmay 18415 1 3 19:02:48 ? 0:00
upmay 16616 1 0 19:02:41 ? 0:00
upmay 20387 1 0 19:02:55 ? 0:00
upmay 20472 1 1 19:02:56 ? 0:00
upmay 23532 1 0 19:03:07 ? 0:00
upmay 23531 1 1 19:03:07 ? 0:00

These processes continue to multiply until the process table is filled. This seems to happen only when the kill script is run at 7:00pm each night. I can kill a user session from the unix prompt in exactly the same manner and cannot seem to reproduce this problem. Any ideas, anyone ????
17 REPLIES 17
Sundar_7
Honored Contributor

Re: multiplying defunct processes

Evelyn,

It will help if you can post the kill script.

-Sundar.
Learn What to do ,How to do and more importantly When to do ?
Pete Randall
Outstanding Contributor

Re: multiplying defunct processes

Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Here's what the script does:
----------------------------------
ps -ef | \
tail +2 | \
grep ksh | \
grep -v "`cat /fh_scripts/isusers.list`" | \
awk '$1 !~ /^c/ {print "kill -9 " $3}' | \
sh

for db in `echo ${dbarray[*]}`
do
echo "Now killing LOCAL users in: $db"
ps -ef | \
tail +2 | \
grep oracle$db | \
grep -v oracle"$db"0899 | \
grep "LOCAL=YES" | \
grep -v "^ oracle" | \
awk '{print "kill -9 " $3," ", $2}' | \
sh
done

for db in `echo ${dbarray[*]}`
do
echo "Now killing NON-LOCAL users in: $db"
ps -ef | \
tail +2 | \
grep oracle$db | \
grep -v oracle"$db"0899 | \
grep "LOCAL=NO" | \
awk '{print "kill -9 " $2}' | \
sh
done
----------------------------

Thanks.....
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Thanks, Pete! Not sure how I did that.
I've closed that one.
Sundar_7
Honored Contributor

Re: multiplying defunct processes

Evelyn,

I am not too sure about killing the PPID. if the child ksh is killed, the parent (telnetd or rlogind or sshd) typically exits gracefully.

I am also not too comfortable just greping for ksh. This will kill processes that you dont want to get killed.

Try this

kill -9 $(ps -ef | egrep -i "^-ksh$|^ksh$" | grep -f /fh_scripts/isusers.list | egrep -v "^c" | awk '{print $2}')

Sundar.
Learn What to do ,How to do and more importantly When to do ?
D Block 2
Respected Contributor

Re: multiplying defunct processes

what is the parent behavior ?

upmay 4096 1 0 08:59:30 ? 0:05 -ksh


my guess it wakes up and see that a child is not responding, so it starts another child process, and so. on.. like a good daemon might behave.

can you kill JUST the parent daemon first so all it's child die at the same time ? Have you tried this one ?
Golf is a Good Walk Spoiled, Mark Twain.
Sundar_7
Honored Contributor

Re: multiplying defunct processes

duh, sorry my egrep wont work. I hope this one does

kill -9 $(ps -ef | awk '$NF ~ /-ksh/ -o $NF /ksh/ {print}' | grep -f /fh_scripts/isusers.list | egrep -v "^c" | awk '{print $2}')
Learn What to do ,How to do and more importantly When to do ?
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Tom,
exactly which "parent daemon" are you saying I should kill? telnetd? I don't think I want to do that. Telnetd is parent to ksh, which is parent to the rest of the session processes, so I've been killing ksh. In this example, the "family" is:

child parent
955 (telnetd) 1 (inetd)
4095 (ksh) 955 (telnetd)
4096 (quick...) 4095 (ksh)
17366 (oracleUK...) 4096 (quick...)

At this point, the first pass is killing just 4095 (ksh). So, is telnetd spawning replacement ksh processes?

Sundar,
Looks like your cmd does the same thing as mind, although more efficiently. I noticed however that you are killing $2 rather than $3. So, you would kill 4096 rather than 4095?

BTW I have run this script on our development system with no problems. That system was patched couple months ago (same patchset). Obviously, I cannot run it on the production sys except once/day at 7:00pm.
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Sorry everybody.. I have described the family incorrectly. Here is the correct version:

child parent command
955 (inetd) 1 (init) inetd
4095 (telnetd) 955 (inetd) telnetd
4096 (ksh) 4095 (telnetd) ksh
17366 (quick) 4096 (ksh) quick
17392 (oracleUK) 17633 (quick) oracleUK

I've been killing 4095, the parent of ksh.

Sundar_7
Honored Contributor

Re: multiplying defunct processes

well, what can I say - lucky you , you got features along with the patches !!

if you ask me, I would kill PID , instead of PPID.
Learn What to do ,How to do and more importantly When to do ?
A. Clay Stephenson
Acclaimed Contributor

Re: multiplying defunct processes

I would shoot someone who threw kill -9's around this indiscriminatingly. At the very least you are leaving temporary files and probably shared memory segments dangling. Start with the PID's rather than PPID's and kill in this order: -15, -1, -2, -3, -11, and finally (and only if necessary kill -9). You should also test if the process is still alive by sending a kill -0 PID between each of the progressive kills; if the result is 0 the process is still alive but if ${?} is non-zero then it's gone.
If it ain't broke, I can fix that.
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Ok, I'll change from killing ppid to killing pid tonight. I'll advise and assign points next week if it works. Then I'll clean up the kill -9's. And then I guess I should put a gun to my head!
Thanks for all the input -- I appreciate the support!
Sundar_7
Honored Contributor

Re: multiplying defunct processes

I believe shell processes (ksh/sh) ignore SIGTERM. My attempt to send -15 to shell processes never managed to kill the processes. I always had to use -9.
Learn What to do ,How to do and more importantly When to do ?
A. Clay Stephenson
Acclaimed Contributor

Re: multiplying defunct processes

Neverless, when killing processes start with a peashooter and end with a cannon. That's why I listed the signals in that order. Note that a kill -11 is almost as deadly as a kill -9 but it does cleanup.
If it ain't broke, I can fix that.
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Thank you Sundar for the supportive comment! I appreciate it.
Clay, your comments are well-taken. The pea-shooter/canon analagy is a good one.

I have changed the script to kill the PIDâ s rather than the PPIDâ s, and still getting the defunct processes. On Sunday I ran the script again to kill user sessions, and still got the defunct processes multiplying. Once I got them cleaned up so that nobody was logged in except me, I then logged in 4 times, ran the script, and got NO defunct processes. I then logged in 6 times, and then 20 times, ran the script and got NO defunct processes. So I am unable to reproduce this at will. I am logging in as any other user (not a privâ d account), and the processes associated with each login is the same as the scenario I described last week.

From what I can tell, it appears that when I kill the PPID, it is the â kshâ process that is spawning the defunct processes. But when I kill the PID, it is the â quickâ process that is spawning the defunct processes.

Could I be exceeding some threshold?? Any other ideas?
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

Well, after fighting for another week, it appears that my kills may have been clobbering each other. I put a sleep command between them and the defunct processes seem to be under control. Thanks for the input, everyone.
Evelyn Daroga
Regular Advisor

Re: multiplying defunct processes

.