Re: Received unhandled signal: 15

Leon Allen · ‎01-17-2006

Hello

On Friday, 13th, for the first time ever, oracle crashed! Impossible? But True!

HPUX 11i on Rp5430 with reasonably up-to-date pathes and NO sign of any OS or HW problems what so ever.
-->swapinfo
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4194304 0 4194304 0% 0 - 1 /dev/vg00/lvol2
dev 4194304 1078712 3115592 26% 0 - 0 /dev/vg08/lvol01
localfs 1048576 0 1048576 0% 1048576 0 1 /u02/paging
reserve - 4738696 -4738696
memory 3300728 926748 2373980 28%
root@cis1: in /home/root

Suddenly, with out any warning, and with NOTHING logged anywhere (no alert.log, lsnr.log, anything), ALL the oracle processes, both system and user, terminated! Very drastic and severe!.

The only trace left behind were 100's small trace files in the udump and bdump directories, one for each of the terminated processes. These trace files all look like:
/u01/app/oracle/admin/csccis/bdump/csccis_s000_20366.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /u01/app/oracle/product/920
System name: HP-UX
Node name: cis1
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: csccis
Redo thread mounted by this instance: 1
Oracle process number: 11
Unix process pid: 20366, image: oracle@cis1 (S000)

*** 2006-01-17 10:11:46.631
Received unhandled signal: 15, code=800003ffbfff68e8
Terminating.

I've logged a call on Friday, but Oracle are still scratching there heads.

I'm not the only one in the world who has experienced this either - see metalink doc id's 608324.999, 573563.994, and several others. There are no useful responses to these metalink article.

Is there anyone out there with any clues?

Many Thanks

Leon Allen
Caboolture, Australia

Time's fun when your having flys (ancient frog saying)

A. Clay Stephenson · ‎01-17-2006

The most like explanation is that Oracle did indeed receive a signal. The default signal (SIGTERM = 15) make this even more likely. Only a privildged user could signal these processes so an ordinary user doing a kill PID could not do this. What could be happening is a kill -15 -PGID which would send a SIGTERM to all members of a process group. It's possible that you have an accidental kill or you may have a cron'ed or at'ed script that looks for "dead" processes.

If it ain't broke, I can fix that.

Leon Allen · ‎01-17-2006

Apologies for my poor spelling/typing above.

I forgot to mention, the silly thing crashed again yesterday, exactly the same symptoms. That's twice now in as many (work) days. I'm stating to really worry now.

On Friday, it was a quite afternnon - not too much activity at all. Yesterday was a normal morning.

To restart after the crash, I have to do

sqlplus /nolog
connect / as sysdba
shutdown abort
startup open
.
.
lsnrctl start
agntctl start

ie the listener process and agent process seem to crash as well.

Cheers!
Leon

Time's fun when your having flys (ancient frog saying)

Leon Allen · ‎01-17-2006

Thanks for your prompt reply Clay.

Given the none-specific times, I haven't suspected a cron job. (pm. one day, am. two days later).

I've check .sh history of root and oracle to see if there was any funny business going on - but did not detect anything.

I might do a cat or string * | grep -l kill (or something like that) to see if anything has been scripted, but that is most unlikely (there is really only me here).

I think the crashes are genuine (cf accidental or malicious)

Time's fun when your having flys (ancient frog saying)

Yogeeraj_1 · ‎01-17-2006

hi Leon,

did you install any patch or other software recently?

also, did you change any kernel parameters recently?

I would also check my syslog file and run a "analyze table ... validate structure" on the tables (if feasible) or do a full database export to make sure there is no form of data corruption somewhere....

also, make sure your backups are up to date and you can perform recovery successfully on your backup server.

take all your precautions

kind regards
yogeeraj

No person was ever honoured for what he received. Honour has been the reward for what he gave (clavin coolidge)

RAC_1 · ‎01-17-2006

I do not how feasible this could be. But how about enabling the auditing and monitoring the kill call only??

There is no substitute to HARDWORK

Leon Allen · ‎01-18-2006

Thanks Yog. No recent changes. A nightly export of the database is done. This indicates no corruptions (it would fair if there was). I will do the analyse.

And thanks RAC; yes, Clays perspective on it was interesting, and did get me thinking. Thinking so much my brain started to hurt, and I did just yesterday turn on auditing of processes, including kill, via sam. I checked what accounts could potetially initiate a kill - we have a GIS account which is a member of the dba group, which through the gis application (ArcSDE) can execute a range of commands. I'm going to keep an eye on this, for if it happens again.

Time's fun when your having flys (ancient frog saying)

rick jones · ‎01-18-2006

I don't know that tusc could show the origin of the SIGTERM but you could try hanging a tusc off of one of the processes - if all the processes were dying with SIGTERMs indeed perhaps the pgroup got it, or maybe a main parent process got upset and decided to take-out the entire family.

IIRC there are ways to register signal handlers such that they can be given a siginfo_t structure that includes information about the origin of the signal. Perhaps Oracle could create a "bugcatcher" that does this if there is no convenient auditing mechanism available.

there is no rest for the wicked yet the virtuous have no pillows

Leon Allen · ‎01-18-2006

Thanks Rick - I'll look into your suggestions.

The audit trail so far (oracle user)

-->audisp -c kill -u oracle /u03/.secure/etc/audfile4
users and aids:
oracle
12
Selected the following events:
37
All ttys are selected.
Selecting successful & failed events.
TIME PID E EVENT PPID AID RUID RGID EUID EGID TTY

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
060118 14:41:22 17288 S 37 17287 12 102 102 102 102 pts/tf
[ Event=kill; User=oracle; Real Grp=dba; Eff.Grp=dba; ]

RETURN_VALUE 1 = 0;
PARAM #1 (int) = 0
PARAM #2 (int) = 15
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
060118 14:41:22 17288 S 37 17287 12 102 102 102 102 pts/tf
[ Event=kill; User=oracle; Real Grp=dba; Eff.Grp=dba; ]

RETURN_VALUE 1 = 0;
PARAM #1 (int) = 0
PARAM #2 (int) = 26
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@cis1: in /home/root
-->

There are a lot more 'kill's' by root. I presue this is all normal (oracle hasn't crashed again yet), and I might see an 'exception' in this audit log if / when it does crash.

What does the above data tell me? PARAM #1 looks like a pid, and PARAM #2 looks like a signum? (opposite order for the actual kill comamnd parameters)

Time's fun when your having flys (ancient frog saying)

rick jones · ‎01-18-2006

i'm not overly familliar with the auditing stuff - is it auditing the "kill" command or is it auditing the kill system call, or both?

there is no rest for the wicked yet the virtuous have no pillows

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Received unhandled signal: 15

Received unhandled signal: 15