Re: "nulptr dereferences trap enabled"

Laurent Menase · ‎12-24-2008

I don't doubt of your application, but one of the possibility for zombies to not be collect is when the parent process is exiting, the process must close all its children procs.
If that close() hang for some kernel related reasons, zombie children will not be collected, and the application can't be killed.

then only crashinfo -l -s -v -t can help to find what hap.

Dennis Handly · ‎12-24-2008

>Prior to server shutdown, ps output looks like this:
admin 20479 1 0 09:20:20 pts/0 00:04 /usr/local/../server64

This is not a zombie.

>After normal shutdown (no kill involved):
-3 20479 1 0 09:20:20 pts/0 00:07 /usr/local/../server64

Hmm, why the -3? If this stays like this, this isn't a zombie, it is wedged. Are there other zombies that are children of this process?

>The tusc output shows what I expect in a process shutdown:
lwp_detached_exit()
exit(0) <--- last entry in the log

After the exit system call, it can only be the kernel that has messed up.

>I don't understand the comments about buggy application code related to waiting for SIGCHLD. The parent pid is init.

A zombie is a child of a process that has sloppy code that doesn't handle the death (SIGCHLD) of the child. Unless 20479 has zombie children, there are no zombies here. I suppose a zombie master could be created if the parent is hung on an NFS mount before it can handle SIGCHLD.

>This is _very_ mature application code

This is meaningless if there is a kernel bug that causes a hang on the exit system call.

We need the output (before and after) of this ps command:
UNIX95=EXTENDED_PS ps -Hfu admin
(Make sure you select "Retain format" or attach a file.)

>Laurent: one of the possibility for zombies to not be collected is when the parent process is exiting,

In this case the application is already broken and is a zombie master. I.e. it should handle the SIGCHLD as soon as possible and not wait until the end.

>the process must close all its children procs.

I'm not sure what you mean here? This is not windows, only files are closed. Orphaned processes are reparented to init.

Laurent Menase · ‎12-24-2008

indeed I made a slip I meant all its file descriptors,
parent process close all its fildescriptors - socket or files-

Dennis Handly · ‎12-24-2008

>Laurent: parent process close all its filedescriptors - socket or files

Ah, perhaps using gpm before and after to look at the open files may help?

Or write a program to call pstat(2) to get the open files?

vmguy · ‎12-24-2008

> process must close all its children procs.

There are no child procs. This is multi-threaded, not multi-process.

>Prior to server shutdown, ps output looks like this:
> admin 20479 1 0 09:20:20 pts/0 00:04 /usr/local/../server64

> This is not a zombie.

Of course not ... "prior to shutdown" is the key phrase. You kept pushing on the concept of our application failing to issue wait().
There is no wait ... no child process ... init is the only process here with small children to care for.

> why -3 ... wedged

Don't know, because I'm not driving the tests, so I don't know if the process remains like this. This is a very repeatable experiment that only a reboot can clear up.

Is this something I should track?

> After the exit system call, it can only be the kernel that has messed up.

Agreed.

>This is _very_ mature application code

>This is meaningless if there is a kernel bug that causes a hang on the exit system call.

Now we're all on the same page. That's the meaningful part ... this is an HPUX kernel bug.

> We need the output (before and after) of this ps command:
> UNIX95=EXTENDED_PS ps -Hfu admin

Precisely the output I requested and have shown in my previous post. There is no complicated process structure here.

>the process must close all file descriptors

I didn't count them all up, but all the file descriptors involved with connect/fcntl/accept/send/recv activity in the tusc output received a zero return code from shutdown(##, SHUT_RDWR) and close(##) requests.

You have a toy program that demonstrates such a failure?

Ooops, I just found reference to:
accept(9,....) = 11

With no close(9) request at shutdown.

Is HPUX "fuser" robust enough, or should I be installing "lsof" ? How about post-exit "netstat" output?

> perhaps using gpm before and after to look at the open files may help?

"gpm" / glance is not part of my normal debugging toolbag. I can suggest it to the onsite HP team.

You can write a test program to generate such a condition? I've never seen unix behave in such a fashion.

"wedged" and "zombie" seem interchangeable to me. If the process can't be killed with_no_mercy ... it walks and feels like a zombie.

Laurent Menase · ‎12-24-2008

Like I already said

crashinfo -l -s -v -t
is a good start. - crashinfo is a WTEC/L3 support tool-
It will give the stack trace, and it is probable that a TOC dump may be necessary then.

Ask to HP support to elevate your call to L3 -where it should have been elevated already-

vmguy · ‎12-24-2008

> crashinfo -l -s -v -t

Thanks Laurent, will report when we get that.

vmguy · ‎12-24-2008

crashinfo seems to be an internal HP tool, only supplied by Tech Support.

There is a version of it posted here:

http://forums13.itrc.hp.com/service/forums/questionanswer.do?threadId=1089090&admit=109447627+1230143352115+28353475

Could you explain what these flags do?

> crashinfo -l -s -v -t

Do I have to get the latest version from HP?

It's not clear to me whether this will diagnose zombie processes, or if it is just used for a panic crash or core file.

Note: we can't get a core file. kill -6 hangs the process without producing a core file. That's a huge stumbling block for remote debugging.

What do I tell the onsite team?

1. Escalate to L3
2. Get crashinfo
3. Run with these flags, and return output to L3.

Anything else?

Dennis Handly · ‎12-24-2008

>There are no child procs. This is multi-threaded, not multi-process.

Then there are no zombies.

>Of course not, "prior to shutdown" is the key phrase.
>There is no complicated process structure here.

That's not what I meant. Since its parent is INIT, it can't be a zombie.
Since you incorrectly mentioned zombies, which have to have a complicated structure.

>With no close(9) request at shutdown.

This could be the problem. And having multiple threads could also be an issue.

>"wedged" and "zombie" seem interchangeable to me. If the process can't be killed it walks and feels like a zombie.

They are completely different. A zombie is a well defined term, a defunct process, due to sloppy application programming. A hung process is due to a bad design of UNIX (I/O hung) or a bug in the kernel.

And as I said, a zombie can be killed by killing the zombie master.

>It's not clear to me whether this will diagnose zombie processes

It will help in diagnosing your hung process.

vmguy · ‎12-24-2008

Are you really following what I am saying?

The process runs correctly, and the parent pid is init. Of course it's not a zombie?
Who said it was?

The process is then told to shutdown.
exit() never completes, the process is hung, holds onto critical resources that prevent it from being restarted without rebooting the machine.

I really don't care what you call the second process state.

I call it "HP's problem".

> And as I said, a zombie can be killed by killing the zombie master.

Sure, stick to your story. Most people following this thread know that you cannot kill init.

>With no close(9) request at shutdown.
This could be the problem.

I have 4 tusc outputs. Only one showed an accept(9) system call that wasn't paired with a close(9). All 4 shutdowns show a clean exit(0), and the process remains hung.

If that's a problem, prove it with a toy program that exhibits the failure.

> And having multiple threads could also be an issue

Oh my ... a problem for modern software, or just HPUX. That's just plain silly.

>It's not clear to me whether this will diagnose zombie processes
>It will help in diagnosing your hung process.

And I'll just have to trust you on this, because the crashinfo binary (it's not a shell script I could hack) is not documented in a public place, and is only available through HP support.

Conclusion: it's HP's problem to solve.
I don't see how I can contribute anything more.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: "nulptr dereferences trap enabled"