Operating System - HP-UX
1825569 Members
3273 Online
109682 Solutions
New Discussion

Unable to kill a non-zombie process. Expert help needed

 
Juan González
Trusted Contributor

Unable to kill a non-zombie process. Expert help needed

Hi everyone,

this is probably the most strange issue I've ever seen in HP-UX...

With top we've detected an apparently hung process, using 100% CPU. And one of the server's CPU shows 100 % of SYS utilization.

The offending process is a java process, a Tomcat server. We tried to restart the Tomcat server but it failed. Then we just tried to kill the process but it survived, so we tried the powerful kill -9 but the process was still there. The state of the process is running:

From top:
CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
3 ? 21243 root 152 20 267M 72656K run 46359:00 100.17 100.00 java

From ps -el:

401 R 0 21243 1 0 152 20 501ba600 7199 - ? 46360:44 java

Then we tried to trace the process with tusc but we got an error:
# tusc 21243
tusc: ttrace(TT_PROC_ATTACH, 21243, 0, 0, dad0001, 0): Permission denied
tusc: no process to attach to

Glance doesn't report this high CPU usage, for example, in low load hours, while top still reports 100 % CPU usage in one of the processors, glance just reports 3 % total CPU usage.

We are aware that glance metrics differ from top's ones and are more accurate. But from glance we also obtain the following information about CPU by Processor:

CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid
2 Enable 0.0 2.0/ 2.0/ 2.0 0 21243

CPU Util User Nice NNice RealTm Sys Intrpt CSwitch Trap Vfault
2 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


And this state doesn't change over the time, so, process 21243 is using CPU 2 all the time to do nothing!!

Going one step further, we've used the kitrace trouhg the script runki to trace what is happening in the kernel. We've observed that the offending process appears every 10 ms, in the hardclock() rutine in the kernel using the processor 2. The following is an extract of this trace:

0.001814 cpu=2 seqcnt=2763898760 pid=21243 ktid=-1 utid=-1 HARDCLOCK pc=0x17b1b0 sym=_sleep state=SYS

0.011816 cpu=2 seqcnt=2763899157 pid=21243 ktid=-1 utid=-1 HARDCLOCK pc=0x17b138 sym=_sleep state=SYS

0.021811 cpu=2 seqcnt=2763899315 pid=21243 ktid=-1 utid=-1 HARDCLOCK pc=0x17d4b4 sym=splsched state=SYS

0.031811 cpu=2 seqcnt=2763899452 pid=21243 ktid=-1 utid=-1 HARDCLOCK pc=0x15a904 sym=$PIC$124 state=SYS

0.041813 cpu=2 seqcnt=2763899572 pid=21243 ktid=-1 utid=-1 HARDCLOCK pc=0x17b138 sym=_sleep state=SYS

....

Any ideas?

Thanks in advance.

Best regards,
Juan


12 REPLIES 12
Mark Grant
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

It's not that unusual Juan. A process that is blocking on IO or similar will not receive the signal. Not even a -9.

I can't tell you what exactly this process is waiting on but I can tell that most times the "solution" is a reboot.
Never preceed any demonstration with anything more predictive than "watch this"
Jeff Schussele
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

Hi Juan,

In that output note the priority (PRI) value.
In your case it's 152.
Anything from 128 - 153 is not only in the kernel range (128 - 177) - it's in the sleep mode of kernel range & is completely nonsignalable - i.e. unkillable.
It's probably waiting on a resource almost all the time and only comes out of that range for very short periods - when it could be killed.
Yoy might try a kill -1 (hangup) or -24 (stop) in the hopes that when it does "wake up" it will act on those. If those don't work, then your best bet is to determine it's parent PID - PPID & try killing that in hopes it will reap it's child.
But it almost appears to me to be poor programming practice that it eats up so much CPU, but is rarely signallable. If it's always waiting on some I/O resource then that resource's data *ought* to be buffered at least. Or simply it's waiting on something it will never see & it's spending a lot of wasted time in the hopes it will.

My 2 cents,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Bharat Katkar
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

Juan,
have a look at WCHAN field in ps output for your Java process. Look for the associated processes and try killing them first and then may be try for killing Java process.

Hope that helps.
Regards,
You need to know a lot to actually know how little you know
RAC_1
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

through glance, you can check what it is waiting for. glance, then select process and then wait state.

But as told by Gurus, Anything that is in kernel mode, can not be killed.
You may try killing the parent process.

Anil
There is no substitute to HARDWORK
Steven E. Protter
Exalted Contributor

Re: Unable to kill a non-zombie process. Expert help needed

if kill -9 won't kill it, then a system boot will be required to kill it.

I'd run mstm on this system and look for trouble.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Jan Sladky
Trusted Contributor

Re: Unable to kill a non-zombie process. Expert help needed

Hi Juan,

if it is possible try to kill the parent process - PPID and all java related processes

br Jan
GSM, Intelligent Networks, UNIX
Navin Bhat_2
Trusted Contributor

Re: Unable to kill a non-zombie process. Expert help needed

As you can see in kitrace the hardclock traces every 10ms or multiples of it. This is provided basically by the isr and timer expiration trying to do various things including putting in a kirecord.

What you need here is not a kitrace or tusc or gdb since the process is not signable, but you need a kernel stack trace. There are utilities that can do this. I suspect this process in a kernel trap mode and will never be released.

Since you have kitrace I suspect you also have crashinfo. Could you please run that with the -t option and paste the stack trace for this pid? That would be a start.

The best and the right thing to do is to take a TOC of the system and send it in to HP for analysis. Then you can see why the process is stuck and what kind of resources is it waiting for. This could be a programming issue too and the crashdump analysis will give the answer.
Juan González
Trusted Contributor

Re: Unable to kill a non-zombie process. Expert help needed


Thanks everyone for your responses.

Following Navin's advice I've attached the stack traces of the offending process's threads.

Now I'm convinced I will have to stop the server to fix this problem. But I'd want to obtain as much information as I can about this issue to find the origin of the problem. I've also opened a case with HP, and I've sent them all this info, I am waiting for their answer.

Best regards,
Juan

Navin Bhat_2
Trusted Contributor

Re: Unable to kill a non-zombie process. Expert help needed

So there are several threads associated with this process and one of them IS in zombie state, could be a reaping issue.
But I would involve the Tomcat developers because looks like they might not be handling signals in the code correctly and handling proper thread exit procedures. I doubt if you will find any signals pending on these threads. Other than the kills you might have manually issued.


Could you make sure you have the following patches assuming you are on 11.11.


[PHKL_28695/PACHRDME/English] - 11.11 Cumulative VM, Psets, Preemption, PRM, MRG
[PHKL_28410/PACHRDME/English] - 11.11 vm preemption point, pdc, vhand performance
[PHKL_25212/PACHRDME/English] - 11.11 vm preemption point, mlock/async_io
[PHKL_28529/PACHRDME/English] - 11.11 VxFS mmap(2) performance improvement; vhand

Geoff Wild
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

Unfortunately, with the PPID being 1, I think your only solution is to reboot....

After that, I would ensure you are running latest Tomcat and patches.

Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Fred Ruffet
Honored Contributor

Re: Unable to kill a non-zombie process. Expert help needed

If it is waiting for something, you may act on this something. For example, if it is waiting on a pipe, close the pipe. Can't glance give you the wait reason (PIPE, IO...) ?

Regards,

Fred
--

"Reality is just a point of view." (P. K. D.)
Juan González
Trusted Contributor

Re: Unable to kill a non-zombie process. Expert help needed

Hi,

we have received the response from HP Support. The problem is related with the bug JAGae65088:

Multithreaded STREAMS UP emulated driver hangs on thread exit with the following stack trace.

_switch+0xc4
thread_exit+0x1e8
thread_process_suspend+0x188
issig+0x2a4
syscall+0x8f0
syscallinit+0x554

This stack trace is the same as in our zombie thread.

This problem is solved with the patch PHNE_29825.

Thank you very much for your help.

Best regards,
Juan Gonzalez