cancel
Showing results for 
Search instead for 
Did you mean: 

Received unhandled signal: 15

Leon Allen
Regular Advisor

Received unhandled signal: 15

Hello

On Friday, 13th, for the first time ever, oracle crashed! Impossible? But True!

HPUX 11i on Rp5430 with reasonably up-to-date pathes and NO sign of any OS or HW problems what so ever.
-->swapinfo
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4194304 0 4194304 0% 0 - 1 /dev/vg00/lvol2
dev 4194304 1078712 3115592 26% 0 - 0 /dev/vg08/lvol01
localfs 1048576 0 1048576 0% 1048576 0 1 /u02/paging
reserve - 4738696 -4738696
memory 3300728 926748 2373980 28%
root@cis1: in /home/root

Suddenly, with out any warning, and with NOTHING logged anywhere (no alert.log, lsnr.log, anything), ALL the oracle processes, both system and user, terminated! Very drastic and severe!.

The only trace left behind were 100's small trace files in the udump and bdump directories, one for each of the terminated processes. These trace files all look like:
/u01/app/oracle/admin/csccis/bdump/csccis_s000_20366.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /u01/app/oracle/product/920
System name: HP-UX
Node name: cis1
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: csccis
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 20366, image: oracle@cis1 (S000)

*** 2006-01-17 10:11:46.631
Received unhandled signal: 15, code=800003ffbfff68e8
Terminating.

I've logged a call on Friday, but Oracle are still scratching there heads.

I'm not the only one in the world who has experienced this either - see metalink doc id's 608324.999, 573563.994, and several others. There are no useful responses to these metalink article.

Is there anyone out there with any clues?

Many Thanks

Leon Allen
Caboolture, Australia
Time's fun when your having flys (ancient frog saying)
15 REPLIES
A. Clay Stephenson
Acclaimed Contributor

Re: Received unhandled signal: 15

The most like explanation is that Oracle did indeed receive a signal. The default signal (SIGTERM = 15) make this even more likely. Only a privildged user could signal these processes so an ordinary user doing a kill PID could not do this. What could be happening is a kill -15 -PGID which would send a SIGTERM to all members of a process group. It's possible that you have an accidental kill or you may have a cron'ed or at'ed script that looks for "dead" processes.
If it ain't broke, I can fix that.
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

Apologies for my poor spelling/typing above.

I forgot to mention, the silly thing crashed again yesterday, exactly the same symptoms. That's twice now in as many (work) days. I'm stating to really worry now.

On Friday, it was a quite afternnon - not too much activity at all. Yesterday was a normal morning.

To restart after the crash, I have to do

sqlplus /nolog
connect / as sysdba
shutdown abort
startup open
.
.
lsnrctl start
agntctl start

ie the listener process and agent process seem to crash as well.

Cheers!
Leon




Time's fun when your having flys (ancient frog saying)
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

Thanks for your prompt reply Clay.

Given the none-specific times, I haven't suspected a cron job. (pm. one day, am. two days later).

I've check .sh history of root and oracle to see if there was any funny business going on - but did not detect anything.

I might do a cat or string * | grep -l kill (or something like that) to see if anything has been scripted, but that is most unlikely (there is really only me here).

I think the crashes are genuine (cf accidental or malicious)
Time's fun when your having flys (ancient frog saying)
Yogeeraj_1
Honored Contributor

Re: Received unhandled signal: 15

hi Leon,

did you install any patch or other software recently?

also, did you change any kernel parameters recently?

I would also check my syslog file and run a "analyze table ... validate structure" on the tables (if feasible) or do a full database export to make sure there is no form of data corruption somewhere....

also, make sure your backups are up to date and you can perform recovery successfully on your backup server.

take all your precautions

kind regards
yogeeraj
No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)
RAC_1
Honored Contributor

Re: Received unhandled signal: 15

I do not how feasible this could be. But how about enabling the auditing and monitoring the kill call only??
There is no substitute to HARDWORK
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

Thanks Yog. No recent changes. A nightly export of the database is done. This indicates no corruptions (it would fair if there was). I will do the analyse.

And thanks RAC; yes, Clays perspective on it was interesting, and did get me thinking. Thinking so much my brain started to hurt, and I did just yesterday turn on auditing of processes, including kill, via sam. I checked what accounts could potetially initiate a kill - we have a GIS account which is a member of the dba group, which through the gis application (ArcSDE) can execute a range of commands. I'm going to keep an eye on this, for if it happens again.
Time's fun when your having flys (ancient frog saying)
rick jones
Honored Contributor

Re: Received unhandled signal: 15

I don't know that tusc could show the origin of the SIGTERM but you could try hanging a tusc off of one of the processes - if all the processes were dying with SIGTERMs indeed perhaps the pgroup got it, or maybe a main parent process got upset and decided to take-out the entire family.

IIRC there are ways to register signal handlers such that they can be given a siginfo_t structure that includes information about the origin of the signal. Perhaps Oracle could create a "bugcatcher" that does this if there is no convenient auditing mechanism available.
there is no rest for the wicked yet the virtuous have no pillows
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

Thanks Rick - I'll look into your suggestions.

The audit trail so far (oracle user)

-->audisp -c kill -u oracle /u03/.secure/etc/audfile4
users and aids:
oracle
12
Selected the following events:
37
All ttys are selected.
Selecting successful & failed events.
TIME PID E EVENT PPID AID RUID RGID EUID EGID TTY

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
060118 14:41:22 17288 S 37 17287 12 102 102 102 102 pts/tf
[ Event=kill; User=oracle; Real Grp=dba; Eff.Grp=dba; ]

RETURN_VALUE 1 = 0;
PARAM #1 (int) = 0
PARAM #2 (int) = 15
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
060118 14:41:22 17288 S 37 17287 12 102 102 102 102 pts/tf
[ Event=kill; User=oracle; Real Grp=dba; Eff.Grp=dba; ]

RETURN_VALUE 1 = 0;
PARAM #1 (int) = 0
PARAM #2 (int) = 26
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@cis1: in /home/root
-->


There are a lot more 'kill's' by root. I presue this is all normal (oracle hasn't crashed again yet), and I might see an 'exception' in this audit log if / when it does crash.

What does the above data tell me? PARAM #1 looks like a pid, and PARAM #2 looks like a signum? (opposite order for the actual kill comamnd parameters)
Time's fun when your having flys (ancient frog saying)
rick jones
Honored Contributor

Re: Received unhandled signal: 15

i'm not overly familliar with the auditing stuff - is it auditing the "kill" command or is it auditing the kill system call, or both?
there is no rest for the wicked yet the virtuous have no pillows
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

I suspect just the system call :-(
Other processes in this category with kill are exit, fork, mlock........
Time's fun when your having flys (ancient frog saying)
rick jones
Honored Contributor

Re: Received unhandled signal: 15

Given that the kill command uses the kill system call, auditing the system call is goodness. If it were only auditing the command that would leave the big gaping hole.
there is no rest for the wicked yet the virtuous have no pillows
Simon Wickham_6
Regular Advisor

Re: Received unhandled signal: 15

Hi Leon,

To receive a default signal (SIGTERM = 15) would mean it was issued by a user with top level access. I would begin by checking there are no cron jobs running.

Is this second crash at about the same time as the previous. Also check the the alert log.

Regards,
Simon
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

Different times, 1st one 15:30; 2nd time 10:11.

Nothing what-so-ever in the alert log pertaining to this.

Specifically, what priviledged accout?

root, for sure.

Could the oracle account kill all the processes in a group?

Could another user, a member of the dba group, kill all the processes in a process group?

What is the best way to monitor / capture this info? (I've already turned auditing of kill process on through sam)

My gut feel is still it's a genuie crash - there are other cases of this problem on metalink - but no resolution for it detailed.

Cheers!
Time's fun when your having flys (ancient frog saying)
Simon Wickham_6
Regular Advisor

Re: Received unhandled signal: 15

Hi Leon,

Do you have any core dumps or are there any Oracle ora- messagges. I would point out it may be worth now enabling tracing and going from there.

Regards,
Simon
Leon Allen
Regular Advisor

Re: Received unhandled signal: 15

I have turned on some tracing, as per recommendations above, and waited for the problem to occurr again. Well, it has happened again (7/2/2006 14:11) - all oracle processes killed with sig 15. I am closing this thread and opening a new one in the hope of understanding what the auditing is telling me, and how I might take the next step towards resolving this problem.
Time's fun when your having flys (ancient frog saying)