Operating System - HP-UX
1834798 Members
2484 Online
110070 Solutions
New Discussion

Re: Serviceguard Cluster failur.,,,

 
SOLVED
Go to solution
Kannandgl_1
Frequent Advisor

Serviceguard Cluster failur.,,,

Greetings All;

I have a two node cluster setup with the primary server named "data1" while the secondary server is named "data2".

The cluster fail over has been working fine with out any problems before. Recently it encountered problem during fail over when the primary server encountered network problems . After N/W problem solved the cluster running primary node suddenly its switched to secondary node .I am attaching syslog as well as orapkg log.

Please check the issues and give your advice.

Regards
Kanna
20 REPLIES 20
Suraj K Sankari
Honored Contributor
Solution

Re: Serviceguard Cluster failur.,,,

Hi,
It may possible there is some network issue, may be heartbeat was drop.

can you post this output
tail -n 150 suslog.log

Suraj
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Greetings Sankari,

Sure yesterday N/W problem happend , but after resolve the n/w issue then only i got fail over problem. now i am restarting the both system now its working fine..

Note: In sys log i am getting some other messages.

May 4 08:18:10 hogisdata1 cmcld: Service orapkg terminated due to an exit(0).
May 4 08:18:10 data1 cmcld: Service orapkg in package orapkg has gone down.
May 4 08:18:10 data1 cmcld: Disabled node data1 from running package orapkg.
May 4 08:18:03 data1 su: + tty?? root-oracle
May 4 08:18:10 data1 above message repeats 4 times
May 4 08:18:10 data1 cmcld: Executing '/etc/cmcluster/orapkg/orapkg.cntl stop' for package orapkg, as service PKG*11265.
May 4 08:18:10 data1 CM-orapkg[25665]: cmhaltserv orapkg
May 4 08:18:10 data1 su: + tty?? root-oracle
May 4 08:18:11 data1 CM-orapkg[25677]: cmmodnet -r -i 10.38.1.46 10.38.1.0
May 4 08:18:15 data1 LVM[25728]: vgchange -a n vgdata
May 4 08:18:15 data1 LVM[25736]: vgchange -a n vglog
May 4 08:18:10 data1 su: + tty?? root-oracle
May 4 08:18:15 data1 cmcld: Service PKG*11265 terminated due to an exit(0).
May 4 08:18:15 data1 cmcld: Halted package orapkg on node data1.
May 4 08:18:15 data1 cmcld: Package orapkg cannot run on this node because switching has been disabled for this node
May 4 08:18:48 data1 cmcld: (data2) Started package orapkg on node data2.

Please advice to me.

Regards
kanna
S.N.S
Valued Contributor

Re: Serviceguard Cluster failur.,,,

Hi Kanna,

Please furnish the pkg logs...

Meanwhile try global switching :
cmmodpkg -e package_name

Awaiting Details
SNS
"Genius is 1% inspiration, 99% Perspiration" - Edison
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Hi SNS,

Where i am got ( path )the pkg log.
In my attachment all details available.

Regards
Kanna
Michal Kapalka (mikap)
Honored Contributor

Re: Serviceguard Cluster failur.,,,

hi,

the package log is located in :

/etc/cmcluster/CLUSTER_NAME/PKG_NAME.cntl.log

mikap
S-M-S
Valued Contributor

Re: Serviceguard Cluster failur.,,,

/etc/cmcluster/orapkg/orapkg.cntl.log
S-M-S
Valued Contributor

Re: Serviceguard Cluster failur.,,,

Hello Kannan,
What was the n/w problem that you get resolved?

Whether the primary node get rebooted while switching the package to secondary node?
Vishu
Trusted Contributor

Re: Serviceguard Cluster failur.,,,

Hi Kannan,

Please provide the syslog.log for May 4. as your attachment does not have complete information required.

What I am getting from your post is that you had a n/w problem and while solving your n/w problem, you might have lost the heartbeat communication between your cluster nodes resulting your primary node switched to secondary node.

Did your primary node reboot after cluster failover?

Thanks
Rita C Workman
Honored Contributor

Re: Serviceguard Cluster failur.,,,

You have left off alot of info, but ignoring that...

Failed over due to N/W issue. OK, it worked.
So now your cluster and pkgs likely are up on data2. Good.

When it fails over, since you only have a 2 node cluster, your package will be left up on data2 and data2 owns the lock disk, so it controls the cluster.

If data1 is back up and working properly, FIRST, it should have automatically re-joined the cluster. Check your /etc/rc.config.d/cmcluster to make sure that AUTSTART_CMCLD=1. If it is set to "0" then on a reboot it will NOT join the cluster. Next, you can put it back into the cluster manually and then you can move your package back to data1 and enable it for failover.

1. cmviewcl
Check the status of your cluster, it probably shows data1 as 'down' or 'unknown'.
With your package running on data2
If data1 is working fine now, then:
2. cmrunnode -v data1
This will start the cluster daemon on data1 and it should rejoin the cluster.
3. cmviewcl -v
Check status of your cluster, it should show data1 as part of the cluster and 'running'.
4. When you can schedule a brief downtime for your application....to put package back on data1:
Stop the pkg
cmhaltpkg -v

Run package back on data1:
cmmodpkg -e
This command will force the package to start on the first node listed in the package.ascii file (which is the primary node), and it will be set for ENABLED for failover.

Lastly,
You need to read the manuals on Service Guard

http://h20000.www2.hp.com/bizsupport/TechSupport/DocumentIndex.jsp?lang=en&cc=us&taskId=101&prodClassId=10008&contentType=SupportManual&docIndexId=64255&prodTypeId=18964&prodSeriesId=4162060

Regards,
Rita

Suraj K Sankari
Honored Contributor

Re: Serviceguard Cluster failur.,,,

Hi,
>>Where i am got ( path )the pkg log.

Path will be under this directiory
/etc/cmcluster/package_dir/package.log or package.cntl.log

Suraj
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Greetings all,

I am worried abt cluster failover 2 day again happen.

Please find the attachment for cluster controll log & syslog file for your refrence.

Kindly advice to me any one expert.

Regards
Rajamani
melvyn burnard
Honored Contributor

Re: Serviceguard Cluster failur.,,,

Your package has shutdown due to a service failure:
Service orapkg in package orapkg has gone down

You nd to check what this sevice is doing, and if it is monitoring database processes, then it is telling you a databse process ha sadied and so the service dies, telling the package manager to cleanly halt the package.
I suggest you open a call with HP for this one.

Also, chec your database logs, patching levels etc.
If this has just started doing this, what has changed around the time it has started happening?
Do you hve a dba going in and perhqaps doing something to the database that stops certain oracle processes?
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Aneesh Mohan
Honored Contributor

Re: Serviceguard Cluster failur.,,,

Hi,

Please check your SERVICE_CMD monitoring script. It appears to me like the monitoring script exit and initiates the package failover.

Please post your Package(orapkg) Monitoring Script.

Aneesh
atul2701
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Hi
Your cluster is monitoring the orapkg service & if this service terminated on your primary node pkg will failover to adoptive node & same happened in your case.
You need to check with your oracle dba team why orapkg service was terminated.
atul
Atul Gupta
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Dear Anesh Mohan,

Please find the attachment for orapkg.sh script for your request.

Regards
Rajamani
Aneesh Mohan
Honored Contributor

Re: Serviceguard Cluster failur.,,,

Hi,

What I could see is...

Line 160 (from your script):-
=======
# Startup Oracle. #
These commands will not return any error codes; therefore, anything # that fails after this point will cause a failover to another node.

......ORACLE DB START UP (Using PKG script).........

....... FAILED TO START (ALREADY DB IS RUNNING) ......

>>SQL> ORA-01081: cannot start already-running ORACLE - shut it down first

...... FAILOVER TO THE NEXT CONFIGURED NODE........

I belive your dba has already started database manually before your start using the script.

This might be the cause for the failover to the next configured node.

Regards,
Aneesh
stephen peng
Valued Contributor

Re: Serviceguard Cluster failur.,,,

just disable service mornitoring and see what happen, if oracle shutdown leads to package failing over, how you team manager to do database maintenance?Of course you should check alert.log of oracle.
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Dear Friends,

This above the issue has been resolved.

1.In oracle side ( node 1 and node 2 ) lisner.ora entry was problem after change correct host information both node .


Regards
rajamani
Kannandgl_1
Frequent Advisor

Re: Serviceguard Cluster failur.,,,

Dear Friends ,

Hope your are helping me.
After 20 days 2 day i am getting fail over node 1 to node 2.the package is automatically disabled on node1.
# cmviewcl -v

CLUSTER STATUS
cluster1 up

NODE STATUS STATE
hogisdata1 up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/3/1/0 lan0
PRIMARY up 0/4/2/0 lan2

NODE STATUS STATE
hogisdata2 up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/3/1/0 lan0
PRIMARY up 0/4/2/0 lan2

PACKAGE STATUS STATE AUTO_RUN NODE
orapkg up running enabled hogisdata2

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service up 0 0 orapkg
Subnet up 10.38.1.0
Subnet up 10.10.10.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up disabled hogisdata1
Alternate up enabled hogisdata2 (current)



Error from oracle controller.log:
===================================
kill: 3457: no such process
Process failed: 3457 - Process-type: 0 }
Starting Oracle Server Listener...

LSNRCTL for HPUX: Version 10.2.0.3.0 - Production on 12-JUN-2010 06:33:26

Copyright (c) 1991, 2006, Oracle. All rights reserved.

TNS-01106: Listener using listener name LISTENER has already been started
Oracle Server Listener start failed.
Starting Oracle Server Listener...

LSNRCTL for HPUX: Version 10.2.0.3.0 - Production on 12-JUN-2010 06:33:26

Copyright (c) 1991, 2006, Oracle. All rights reserved.

TNS-01106: Listener using listener name LISTENER has already been started
Oracle Server Listener start failed.
Starting Oracle Server Listener...

LSNRCTL for HPUX: Version 10.2.0.3.0 - Production on 12-JUN-2010 06:33:26

Copyright (c) 1991, 2006, Oracle. All rights reserved.

TNS-01106: Listener using listener name LISTENER has already been started
Oracle Server Listener start failed.

=======================================
root 6804 1 12 Jun 10 ? 2:27 /opt/hpsmc/avc/bin/monitorsvcd
oracle 3351 1 0 06:32:20 ? 0:00 oracleewagdb (LOCAL=NO)
Stopping EM.....

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Jun 12 06:33:27 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> ORACLE instance shut down.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
Oracle abort done.
Stopping Oracle Server Listener...

LSNRCTL for HPUX: Version 10.2.0.3.0 - Production on 12-JUN-2010 06:33:29

Copyright (c) 1991, 2006, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=cluster1)(PORT=1521)))
The command completed successfully
Oracle Server Listener stop done.

*** tnslsnr process has stopped. ***
==============

+ date
+ print \n\t########### Node "hogisdata1": Halting package at Sat Jun 12 06:33:31 WAT 2010 ###########

########### Node "hogisdata1": Halting package at Sat Jun 12 06:33:31 WAT 2010 ###########
+ stop_resources
+ halt_services
Jun 12 06:33:31 - Node "hogisdata1": Halting service orapkg
cmhaltserv : Service name orapkg is not running.
+ customer_defined_halt_cmds
hogisdata1
Sat Jun 12 06:33:31 WAT 2010
.:/orabin/oracle/product/10g/db/bin:/sbin:/tools/bin:/usr/sbin:/usr/sbin:/usr/bin:/usr/sbin:/etc:/bin:/orabin/oracle/product/10g/db/audit/scripts:/opt/clic/bi
n://orabin/oracle/product/10g/db/jdk/bin:/orabin/oracle/product/10g/db/dcm/bin:/orabin/oracle/product/10g/db/opmn/bin:/orabin/oracle/product/10g/db/Apache/Apa
che/bin:/opt/java1.4/bin:/orabin/oracle/product/10g/db/bin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/etc:/bin

*** /etc/cmcluster/orapkg/orapkg.sh called with stop argument. ***


"hogisdata1": Shutting down Oracle SESSION ewagdb at Sat Jun 12 06:33:31 WAT 2010
Stopping OPMN managed processes...
Stopping EM.....
Stopping Oracle Server .....

SQL*Plus: Release 10.2.0.3.0 - Production on Sat Jun 12 06:33:31 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Connected to an idle instance.
=======================


Deeos
Regular Advisor

Re: Serviceguard Cluster failur.,,,

hi kannandgl,

may I know, what was network issue?



Regards
Deeos
Deepak