System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

"nulptr dereferences trap enabled"

SOLVED
Go to solution
vmguy
Frequent Advisor

"nulptr dereferences trap enabled"

I have two nearly identical HPUX ia64 systems.

An identical library on each (confirmed with cksum) shows different output from chatr:

* "nulptr dereferences trap enabled"
* "nulptr references enabled"

The first system causes a zombie process when the application is shutdown normally.

The second does not.

If these systems are identical, what kernel configuration setting is causing null pointer dereferences to be trapped?

How do I change that?
Is it hardware differences, or software?

Google is virtually silent on this.

HPUX ia64 B.11.23d
HP Case ID: 1603027519

59 REPLIES
Patrick Wallek
Honored Contributor

Re: "nulptr dereferences trap enabled"

You should start by checking the differences in patches between the 2 systems.

# show_patches

is a good place to start.
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>An identical library on each (confirmed with cksum) shows different output from chatr:
>* "nulptr dereferences trap enabled"
>* "nulptr references enabled"

(The -z/-Z option isn't valid for shlibs. So this makes no difference, especially if the same cksum.)

As Patrick says, you have different linker/dld patches on the two systems. I specifically asked that the chatr(1) message be changed to be more understandable. I'm assuming the first one has the newer linker.

>causing null pointer dereferences to be trapped

This is a good thing(tm).
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

Patrick:
> check the differences in patches

I've checked, but I don't have enough experience to know what the _significant_ differences are.

Attached is my own format of differences derived from inventory.xml from each machine.

"machine1" fails (crashes)
"machine2" does not.

Dennis:
> (The -z/-Z option isn't valid for shlibs. So this makes no difference, especially if the same cksum.)

Perhaps you missed my point. This is a 3rd party library, downloaded to each machine from the vendor, and chatr gives different results.

The reason for using "chksum" is to discover, during install, if the library had been altered. It was not altered.

The development guru for the product seems pretty certain that -z provides the functionalty needed.

It's just that each HPUX has a different understanding of the library behaviour.

> you have different linker/dld patches

Can you explain why that makes a difference?
I didn't build this library on each machine.
Only chatr (and ultimately the execution stack) treats this binary library differently.

> I'm assuming the first one has the newer linker

The message is from chatr, not the linker.

>causing null pointer dereferences to be trapped ... is a good thing(tm).

Great engineering theory. Terrible production problem if it cannot be controlled. I don't have control over this library, but it causes the software that uses it to create zombies on shutdown.

-----
Note: we have used tusc to observe the shutdown system calls. Everything looks normal, except the process doesn't actually come down.

Many methods were tried to get more information on the issue.

kill #pid .... produces a zombie
kill -6 #pid ... never produces a core, just a zombie
kill -9 pid# ... does not remove the process, nor can we get rid of the zombie.

------
I'm suspicious that of the suggestion that this is merely a difference in message format.

One ignores null pointer references:
One causes a "trap" (some HPUX mechanism) when a null pointer is dereferenced.

All of this is during shutdown of the process. The software works correctly in all other respects.

What is the mechanism for trapping null pointer dereferences?
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

Oooo, I learned something new in my google wanderings.

Google
[ "null pointer" trap -java site:hp.com ]

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1259127

"If I "chatr -z " (enable the "nulptr dereferences trap")on the exe"

I didn't know you could use chatr this way.

Should I be instructing the customer to use:

chatr -Z library.so
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

There goes that idea. Dennis has already contributed to the thread.

> Are you sure you used chatr(1) on the executable and not some shared lib?

And now I understand the confusion about the -z ... one is used a compiler time, the other at execution.

I don't know if the compiler builds in special code for handling null pointers, but the explanation provided (thanks Dennis) makes sense:

> In order to implement the -Z default, a special null pointer page is used by the kernel. All processes with -Z share it.

So perhaps I should be instructing the user to use:

chatr -z executable_name

so that this global page zero is not used?

Thanks for the valuable discussion so far.
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>but I don't have enough experience to know what the _significant_ differences are.

I already told you, linker/dld patches.
You do have PHSS_37947. Machine2 seems to have this old patch PHSS_34353.

(But this is besides Patrick's point, the party line is if one fails, you make it like the other. :-)

>Perhaps you missed my point. This is a 3rd party library, downloaded to each machine from the vendor, and chatr gives different results.

I don't see how, I told you exactly what patches to check. And I mentioned that the different chatr results are not important for 2 reasons, only that there are differences.

>The reason for using "cksum" is to discover, during install, if the library had been altered.

Exactly and that goes to what I said. I was the proximal cause of the change to chatr, JAGag09149 in PHSS_34852.

>The development guru for the product seems pretty certain that -z provides the functionality needed.

I'm not sure how? -z detects sloppiness. It won't fix things.

>It's just that each HP-UX has a different understanding of the library behaviour.

Not really, just better wording.

>Can you explain why that makes a difference? Only chatr (and ultimately the execution stack) treats this binary library differently.

dld is the software that handles shlibs and process start/exit. So, dld has everything to do with the execution.

>The message is from chatr, not the linker.

The N inferences are that if chatr changes so does dld.

>but it causes the software that uses it to create zombies on shutdown.

You haven't explained how it fails and there may be NO connection with -z and the abort. There are plenty of ways dld can hose you over.

>Everything looks normal, except the process doesn't actually come down.

If you see no signals, this it is likely unrelated.

>kill -9 pid does not remove the process, nor can we get rid of the zombie.

You get rid of zombies by killing the zombie master.

>I'm suspicious that of the suggestion that this is merely a difference in message format.

Why?? That's exactly what JAGag09149 did. Also that difference means that dld is different.

>All of this is during shutdown of the process.

dld has had problems in this area.

>What is the mechanism for trapping null pointer dereferences?

Hardware R/W protection on page 0.

>Should I be instructing the customer to use: chatr -Z library.so

I said that was useless, it is effective only on the executable.

>-z one is used a compile time, the other at execution.

No, one is used at link time, the other post link.

>I don't know if the compiler builds in special code for handling null pointers

Why? That's what the hardware is for.

>So perhaps I should be instructing the user to use: chatr -z executable_name

That may catch a problem earlier. But your major point is that everything is the same except the patches installed.
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

Patrick: Thanks. I learned:

1. Somehow I seemed to have annoyed you with my ignorance of HPUX. I'm sorry.

2. Wasn't familiar with the "dld" acronym, now I am. You told me, I didn't understand.

3. Didn't know there was a zombie master on HPUX, now I do.

4. Understand now why chatr on the executable puts the process into that protected page 0 group.

"Null pointer reference" is slightly misleading; any pointer value less than the protected page size will be trapped.

5. Thanks for the PHSS_37947 patch reference. I understand now why I need that. I wasn't able to pick that out of the list of differences on my own.

-- Cheers. I'll report results when we have a resolution.
Patrick Wallek
Honored Contributor
Solution

Re: "nulptr dereferences trap enabled"

vmguy,

You didn't annoy me. My ONLY response to this thread was the first one. Dennis Handly responded the rest of the time.

Please try to keep the people that respond straight.

No offense Dennis, but I really don't want to be you! :)
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

> You get rid of zombies by killing the zombie master

"zombie master" is a term of your own creation to describe the "parent pid" of a zombie process?

Please alert when you want to play games on a technical forum.

http://en.wikipedia.org/wiki/Zombie_Master

http://research.facetime.com/term_show.php?id=90
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>"zombie master" is a term of your own creation to describe the parent PID of a zombie process?

That's correct, the term from folklore provides an identifiable name to the two modern computer science uses, spam (your second link) and bad UNIX programming.
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

> bad unix programming.

I presume you're talking about dld.

Our code does not fail on any other unix platform. It is a standard unix technique to make mission critical software a daemon.

Patch PHSS_37947 is already in place, and the problem persists. We're stuck.
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>I presume you're talking about dld.

No, we were talking about zombie masters and their lack of proper handling of SIGCHLD.

>It is a standard unix technique

I thought that caused the parent to exit so INIT is the parent? If so, there can't be zombies, just orphans.

>Patch PHSS_37947 is already in place, and the problem persists.

Are the checksums of the two /usr/lib/dld.sl the same?
If they are, you'll need to look at other patch differences.

At the top you mentioned 1603027519. If you are in contact with the Response Center, why are you asking questions here?
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

> Why am I asking questions here

Good question. Why indeed is HP tech support blowing us off on an important issue.

I'm asking questions here because the HP channel is silent with resolutions.

Why are you asking? This is not fun.
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>I'm asking questions here because the HP channel is silent with resolutions.

Did they mention the shutdown for two weeks may delay responses?

>Why are you asking?

I didn't want to waste time asking questions and making suggestions that they were already doing. And the first would be getting the patch level on the two machines the same.
Laurent Menase
Honored Contributor

Re: "nulptr dereferences trap enabled"

Apparently you have a contract number problem and your call was closed.
If you have a software support contract escalate. -Support doesn't stop during holiday closure-

you can try to take a "tusc -p -E -o resfile pidofprocesses" from processes - parent and child processes-, started before the shutdown of the application.
It is possible that the parent process stay hang on some syscall on closure, causing SIGCLD to not be taken, leaving zombies

I like the "zombie master" I'll reuse it.
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

Yes, we have tusc output. It all looks normal to me.

I'm involved with an escalation because no progress is being made for 2 months.

Sure, go ahead and use "Zombie Master", just make sure the listener understands what you're talking about. I wasted a bunch of time chasing a non-existent unix concept.

For 30 years, ppid has been sufficient to describe the condition. I thought it was rather innovative for HPUX to create a "zombie master" that orphans could cling to intead of init (ppid = 1).

Invent ... doesn't apply.
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>PPID has been sufficient to describe the condition.

Where is the fun in that? :-)

>that orphans could cling to instead of init (ppid = 1)

That's still the case. When the zombie master is killed, init kills/reaps the orphaned zombies.

In any case, this is the secondary issue compared to finding out what patch is missing or broken. Have you compared the checksums for /usr/lib/dld.sl?
Laurent Menase
Honored Contributor

Re: "nulptr dereferences trap enabled"

caseid 1603027519 looks like a short cut case.
so you probably have an other caseid.
Did they elevate to WTEC/L3 support?
Do you see the SIGCLD in the tusc output?
what syscall the parent process is hang on?
Is the parent process killable?
Did you get a crashinfo -l -s -v -t of the system when there are the zombies process to see what is the state of the parent process.
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

Prior to server shutdown, ps output looks like this:

admin 20479 1 0 09:20:20 pts/0 00:04 /usr/local/../server64

After normal shutdown (no kill involved):

-3 20479 1 0 09:20:20 pts/0 00:07 /usr/local/../server64

The tusc output shows what I expect in a process shutdown:

munmap()
unlink()
lwp_detached_exit()
exit(0) <--- last entry in the log

I don't understand the comments about buggy application code related to waiting for SIGCHLD. The parent pid is init.

This is _very_ mature application code running on all flavors of unix and windows.
Laurent Menase
Honored Contributor

Re: "nulptr dereferences trap enabled"

I don't doubt of your application, but one of the possibility for zombies to not be collect is when the parent process is exiting, the process must close all its children procs.
If that close() hang for some kernel related reasons, zombie children will not be collected, and the application can't be killed.

then only crashinfo -l -s -v -t can help to find what hap.

Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>Prior to server shutdown, ps output looks like this:
admin 20479 1 0 09:20:20 pts/0 00:04 /usr/local/../server64

This is not a zombie.

>After normal shutdown (no kill involved):
-3 20479 1 0 09:20:20 pts/0 00:07 /usr/local/../server64

Hmm, why the -3? If this stays like this, this isn't a zombie, it is wedged. Are there other zombies that are children of this process?

>The tusc output shows what I expect in a process shutdown:
lwp_detached_exit()
exit(0) <--- last entry in the log

After the exit system call, it can only be the kernel that has messed up.

>I don't understand the comments about buggy application code related to waiting for SIGCHLD. The parent pid is init.

A zombie is a child of a process that has sloppy code that doesn't handle the death (SIGCHLD) of the child. Unless 20479 has zombie children, there are no zombies here. I suppose a zombie master could be created if the parent is hung on an NFS mount before it can handle SIGCHLD.

>This is _very_ mature application code

This is meaningless if there is a kernel bug that causes a hang on the exit system call.

We need the output (before and after) of this ps command:
UNIX95=EXTENDED_PS ps -Hfu admin
(Make sure you select "Retain format" or attach a file.)

>Laurent: one of the possibility for zombies to not be collected is when the parent process is exiting,

In this case the application is already broken and is a zombie master. I.e. it should handle the SIGCHLD as soon as possible and not wait until the end.

>the process must close all its children procs.

I'm not sure what you mean here? This is not windows, only files are closed. Orphaned processes are reparented to init.
Laurent Menase
Honored Contributor

Re: "nulptr dereferences trap enabled"

indeed I made a slip I meant all its file descriptors,
parent process close all its fildescriptors - socket or files-
Dennis Handly
Acclaimed Contributor

Re: "nulptr dereferences trap enabled"

>Laurent: parent process close all its filedescriptors - socket or files

Ah, perhaps using gpm before and after to look at the open files may help?

Or write a program to call pstat(2) to get the open files?
vmguy
Frequent Advisor

Re: "nulptr dereferences trap enabled"

> process must close all its children procs.

There are no child procs. This is multi-threaded, not multi-process.

>Prior to server shutdown, ps output looks like this:
> admin 20479 1 0 09:20:20 pts/0 00:04 /usr/local/../server64

> This is not a zombie.

Of course not ... "prior to shutdown" is the key phrase. You kept pushing on the concept of our application failing to issue wait().
There is no wait ... no child process ... init is the only process here with small children to care for.

> why -3 ... wedged

Don't know, because I'm not driving the tests, so I don't know if the process remains like this. This is a very repeatable experiment that only a reboot can clear up.

Is this something I should track?

> After the exit system call, it can only be the kernel that has messed up.

Agreed.

>This is _very_ mature application code

>This is meaningless if there is a kernel bug that causes a hang on the exit system call.

Now we're all on the same page. That's the meaningful part ... this is an HPUX kernel bug.

> We need the output (before and after) of this ps command:
> UNIX95=EXTENDED_PS ps -Hfu admin

Precisely the output I requested and have shown in my previous post. There is no complicated process structure here.

>the process must close all file descriptors

I didn't count them all up, but all the file descriptors involved with connect/fcntl/accept/send/recv activity in the tusc output received a zero return code from shutdown(##, SHUT_RDWR) and close(##) requests.

You have a toy program that demonstrates such a failure?

Ooops, I just found reference to:
accept(9,....) = 11

With no close(9) request at shutdown.

Is HPUX "fuser" robust enough, or should I be installing "lsof" ? How about post-exit "netstat" output?

> perhaps using gpm before and after to look at the open files may help?

"gpm" / glance is not part of my normal debugging toolbag. I can suggest it to the onsite HP team.

You can write a test program to generate such a condition? I've never seen unix behave in such a fashion.

"wedged" and "zombie" seem interchangeable to me. If the process can't be killed with_no_mercy ... it walks and feels like a zombie.