Operating System - HP-UX
1752328 Members
5989 Online
108786 Solutions
New Discussion

Service guard Oracle dataguard hang monitoring.

 
Boerbokrib
Advisor

Service guard Oracle dataguard hang monitoring.

Hi

 

The database admin has asked me to turn off oracle hang monitoring as it is detecting the oracle in a hang state when we run our os backups with Tivoli.

 

I have set this to alert and monitor at 60 seconds. However we had the listener monitor cause a failover as below error. Can I set any times and alerts for the listener?

 

I have included my service that are in my package config file.

 

This seems to correlate when the backup hits the /orabin.  What files should be excluded from the backups.

 

currently the exclusions are.

 

exclude.fs     "/oradata*"
exclude.dir     "/orarecovery/*"
exclude.dir     "/orabin/diag/*"
exclude.dir     "/orabin/product/agent11g/*"

 

 

 

Aug  8 20:13:25 - Node "alsop1" The database hang check script is not responding.  Killing the process

/opt/cmcluster/toolkit/oracle/toolkit.sh[3]: 25563 Killed

Aug  8 20:13:25 - Node "alsop1" ERROR: Database hang detected.  There is a possiblity that the database c

ould be hung.

08/08/12-20:13:25  SGAlert message sent to: unix_monitor@ialch.co.za

Aug  8 20:13:33 - Node "alsop1"  Oracle Listener listener_smsp failure detected.

Aug  8 20:13:33 - Node "alsop1" Oracle Listener listener_smsp failed

Aug  8 20:13:36 - Node "alsop1" All listeners have failed

Aug  8 20:13:36 root@alsop1 master_control_script.sh[25780]: ###### Halting package smspdg_DB ######

Aug  8 20:13:36 root@alsop1 service.sh[25791]: Halting service oracle_service_smsp

Aug  8 20:13:36 root@alsop1 service.sh[25791]: Halting service oracle_listener_service_smsp

 

 

For sake of completeness.

 

service_name                    oracle_service_ftpz
service_cmd                     "$SGCONF/scripts/ecmt/oracle/tkit_module.sh oracle_monitor"
service_restart                 none
service_fail_fast_enabled                       no
service_halt_timeout                    300
service_name                    oracle_listener_service_ftpz
service_cmd                     "$SGCONF/scripts/ecmt/oracle/tkit_module.sh oracle_monitor_listener"
service_restart                 none
service_fail_fast_enabled                       no
service_halt_timeout                    300
service_name                    oracle_hang_service_ftpz
service_cmd                     "$SGCONF/scripts/ecmt/oracle/tkit_module.sh oracle_hang_monitor 300 alert"
service_restart                 none
service_fail_fast_enabled                       no
service_halt_timeout                    300
service_name                    dataguard_service_ftpz
service_cmd                     "$SGCONF/scripts/tkit/dataguard/tkit_module.sh dataguard_monitor"
service_restart                 none
service_fail_fast_enabled                       no
service_halt_timeout                    300

5 REPLIES 5

Re: Service guard Oracle dataguard hang monitoring.

Looking at your errors, I'm going to assume you have posted the first relevant entry in the log file and there is nothing else interesting on or around this time in your log file:

 

Aug  8 20:13:25 - Node "alsop1" The database hang check script is not responding.  Killing the process

/opt/cmcluster/toolkit/oracle/toolkit.sh[3]: 25563 Killed

 

This tells us that a sqlplus call of "SELECT STATUS FROM V$INSTANCE" has been hung for 5 minutes - can your DBA explain why that might happen during an OS backup?

 

However that I don't think caused  your failure, as it appears you have that set to only alert, rather than initiating a failover...

 

Aug  8 20:13:33 - Node "alsop1"  Oracle Listener listener_smsp failure detected.

Aug  8 20:13:33 - Node "alsop1" Oracle Listener listener_smsp failed

Aug  8 20:13:36 - Node "alsop1" All listeners have failed

 

This is what is causing Serviceguard to halt the package - again here this looks like a call to "lsnrctl status listener_smsp" has returned a non-zero value... why would that ahppen during an OS backup?

 

So the big question is, what is Tivoli doing during an OS backup to cause Oracle to stop responding? All seems a bit odd to me.

 

Of course, this being a community support forum, it could be, you don't actually care about solving what is really going on here, you just want to get rid of the error and move on to the next issue in your queue ;o) - If that's the case, you should possibly consider having the Tivoli backup create a maintrenence flag before it starts and delete it after the backup is finished. This is pretty easy to do... I assume Tivloi has some sort of capability to insert a pre- and post- backup script? If so just have it touch a file called "oracle.debug" in the packages directory before the backup, and remove it after the backup. While the file <package dir>/oracle.debug exists, Serviceguard won't monitor the database.

 

But me, I'd want to understand what is going on...


I am an HPE Employee
Accept or Kudo
Boerbokrib
Advisor

Re: Service guard Oracle dataguard hang monitoring.

Hi

 

I would defnitley like to solve the problem.

 

However the DBA cannot tell me why the database is hanginging.

SELECT STATUS FROM V$INSTANCE According to him shows open. But it must at some stage not show this for it to hang.

Tivoli is just a bitch to work with and I cannot understand it at all.But that is all the customer has.

 

It defnitley seems to be when it is backing up /orabin but exactly which file i cannot tell. The timeouts were 60 seconds i have just changed them to 300 seconds as of today.

 

 

Boerbokrib
Advisor

Re: Service guard Oracle dataguard hang monitoring.

what would be the sql for the listner?
Boerbokrib
Advisor

Re: Service guard Oracle dataguard hang monitoring.

ok the listner is monitred through the halistener.mon which is actually just a call to the command .

 

lsnrctl status listener_id.

 

which gives me this output.

 

LSNRCTL for HPUX: Version 11.2.0.2.0 - Production on 14-AUG-2012 08:08:33

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=alsop1.ialch.co.za)(PORT=1531))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=smsp.ialch.co.za)))
STATUS of the LISTENER
------------------------
Alias                     listener_smsp
Version                   TNSLSNR for HPUX: Version 11.2.0.2.0 - Production
Start Date                08-AUG-2012 20:31:56
Uptime                    5 days 11 hr. 36 min. 36 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /orabin/product/11.2.0/dbhome_1/network/admin/listener.ora
Listener Log File         /orabin/diag/tnslsnr/alsop1/listener_smsp/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=alsop1.ialch.co.za)(PORT=1531)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1531)))
Services Summary...
Service "smsp.ialch.co.za" has 2 instance(s).
  Instance "smsp", status UNKNOWN, has 1 handler(s) for this service...
  Instance "smsp", status READY, has 1 handler(s) for this service...
Service "smspXDB.ialch.co.za" has 1 instance(s).
  Instance "smsp", status READY, has 1 handler(s) for this service...
The command completed successfully

 

what does the UNKNOWN mean is it normal?

 

 

Boerbokrib
Advisor

Re: Service guard Oracle dataguard hang monitoring.

At the time of failure I had this error. I have to state now I am not that familiar with Oracle yet. We did not have any network issues in my syslog at this time at all. And the network admin says he did not see anything at this time.

 

what was running at this time was the Tivoli backup. But I cannot say which file it was backing up at this time as it does not have tiem stamps for each file backed up.

 

08-AUG-2012 20:13:30 * <unknown connect data> * (ADDRESS=(PROTOCOL=tcp)(HOST=::1)(PORT=33917)) * status * <unknown sid> * 12525.

 

THe oracle error states below.

ORA-12525: TNS:listener has not received client"s request in time allowed

Cause: The listener disconnected the client because the client failed to provide the necessary connect information within the allowed time interval. This may be a result of network or system delays; or this may indicate that a malicious client is trying to cause a Denial of Service attack on the listener.

Action: If the error occurred because of a slow network or system, reconfigure INBOUND_CONNECT_TIMEOUT to a larger value. If a malicious client is suspected, use the address in listener.log to identify the source and restrict access. Turn on tracing for more information.