MC ServiceGuard Cluster Failover Problem

Kenrick Sy · ‎12-26-2007

Hi,

I have a two node cluster setup with the primary server named "datasvr" while the secondary server is named "appl".

The cluster failover has been working before. Recently it encountered problem during failover when the primary server encountered disk full so I switch to the secondary server to act as a temporary primary server (appl) while fixing the current primary server (datasvr).

These are the steps done:
1. cmviewcl -v on datasvr

# cmviewcl -v
----------------------------------------------
CLUSTER STATUS
FEU_CLUSTER up

NODE STATUS STATE
appl up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/0/0/0 lan0
PRIMARY up 0/4/0/0/6/0 lan1
STANDBY up 0/7/0/0/6/0 lan3

NODE STATUS STATE
datasvr up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/0/0/0 lan0
STANDBY up 0/4/0/0/7/0 lan2
PRIMARY up 0/4/0/0/6/0 lan1

PACKAGE STATUS STATE PKG_SWITCH NODE
ORADB up running enabled datasvr

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service uninitia 0 0 Oracle_DB
Subnet up 192.168.0.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary up enabled datasvr (current)
Alternate up enabled appl
----------------------------------------------

2. cmhaltnode datasvr
----------------------------------------------
# cmhaltnode datasvr
cmhaltnode : Package ORADB is still running on datasvr.
Use the -f option to forcefully halt the node including halting packages.
# cmhaltnode -f datasvr
Disabling package switching to all nodes being halted.
Warning: Do not modify or enable packages until the halt operation is completed.

Halting Package ORADB
Halting cluster services on node datasvr
..
cmhaltnode : Successfully halted all nodes specified.
Halt operation completed.
----------------------------------------------

3. cmviewcl -v on datasvr

----------------------------------------------
# cmviewcl -v

CLUSTER STATUS
FEU_CLUSTER up

NODE STATUS STATE
appl up running

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY up 0/0/0/0 lan0
PRIMARY up 0/4/0/0/6/0 lan1
STANDBY up 0/7/0/0/6/0 lan3

PACKAGE STATUS STATE PKG_SWITCH NODE
ORADB up running enabled appl

Policy_Parameters:
POLICY_NAME CONFIGURED_VALUE
Failover configured_node
Failback manual

Script_Parameters:
ITEM STATUS MAX_RESTARTS RESTARTS NAME
Service up 0 0 Oracle_DB
Subnet up 192.168.0.0

Node_Switching_Parameters:
NODE_TYPE STATUS SWITCHING NAME
Primary down datasvr
Alternate up enabled appl (current)

NODE STATUS STATE
datasvr down halted

Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY unknown 0/0/0/0 lan0
STANDBY unknown 0/4/0/0/7/0 lan2
PRIMARY unknown 0/4/0/0/6/0 lan1
----------------------------------------------

It now shows that appl is now the current primary server.

4. bdf on appl
----------------------------------------------
Filesystem kbytes used avail %used Mounted on
/dev/vg00/lvol3 143360 45092 92176 33% /
/dev/vg00/lvol1 83733 52870 22489 70% /stand
/dev/vg00/lvol8 1105920 512825 556484 48% /var
/dev/vg00/lvol7 1179648 521665 616880 46% /usr
/dev/vg00/u01 4096000 1288668 2631934 33% /u01
/dev/vg00/lvol4 65536 43907 20282 68% /tmp
/dev/vg00/lvol6 536576 401811 126392 76% /opt
/dev/vg00/lvol5 20480 2392 17018 12% /home
/dev/vgdb/u02 2048000 1554689 462484 77% /u02
/dev/vgdb/u03 1536000 375122 1088329 26% /u03
/dev/vgdb/u04 10240000 1643575 8060965 17% /u04
/dev/vgdb1/u05 10240000 2234660 7755178 22% /u05
/dev/vgdb2/u6 15360000 14032994 1285542 92% /u06
/dev/vgdb2/u7 15360000 5741138 9318688 38% /u07
----------------------------------------------
I can see that the cluster failover was successful since u02, u03, u04, u05, u06 and u07 are present.

5. After a few seconds, I run bdf again on appl server

----------------------------------------------
# bdf
Filesystem kbytes used avail %used Mounted on
/dev/vg00/lvol3 143360 43152 93994 31% /
/dev/vg00/lvol1 83733 52870 22489 70% /stand
/dev/vg00/lvol8 1105920 512824 556485 48% /var
/dev/vg00/lvol7 1179648 521665 616880 46% /usr
/dev/vg00/u01 4096000 1288663 2631938 33% /u01
/dev/vg00/lvol4 65536 43907 20282 68% /tmp
/dev/vg00/lvol6 536576 401811 126392 76% /opt
/dev/vg00/lvol5 20480 2392 17018 12% /home
#
----------------------------------------------

It seems that the cluster is breaking. u02, u03, u04, u05, u06 and u07 are not present anymore.

I have also attached syslog.log file.

Hope you can help me on this.

Thanks in advance.

Kenrick

Kenrick Sy · ‎12-26-2007

Attached is the syslog of datasvr.

Kenrick Sy · ‎12-26-2007

Attached is the syslog of appl.

Ludovic Derlyn · ‎12-27-2007

hi,

to debug, log cluster are necessary
this log is into /etc/cmcluster/"name of package or cluster"/"name".cntl.log

If you are a problem of vg , first i mount filesystems, detect error and umount filesystem

Have you execute export vg on first node and import on secondary ?
Check your log (*.cntl.log) for more informations

Regards
L-DERLYN

Mridul Shrivastava · ‎12-27-2007

see the following messages on aapl syslog.log:
Dec 27 10:54:08 appl cmcld: Service PKG*18433 terminated due to an exit(0).
Dec 27 10:54:08 appl cmcld: Started package ORADB on node appl.
Dec 27 10:54:38 appl cmcld: Service Oracle_DB terminated due to an exit(1).
Dec 27 10:54:38 appl cmcld: Service Oracle_DB in package ORADB has gone down.
Dec 27 10:54:38 appl cmcld: Disabled node appl from running package ORADB.
Dec 27 10:54:38 appl cmcld: Executing '/etc/cmcluster/oradb/oradb.cntl stop' for package ORADB, as service PKG*18433.
Dec 27 10:54:38 appl CM-ORADB[8793]: cmhaltserv Oracle_DB
Dec 27 10:54:38 appl : su : + tty?? root-oracle
Dec 27 10:54:38 appl CM-ORADB[8830]: cmmodnet -r -i 192.168.0.93 192.168.0.0
Dec 27 10:54:42 appl LVM[8878]: vgchange -a n vgdb
Dec 27 10:54:42 appl LVM[8881]: vgchange -a n vgdb1
Dec 27 10:54:42 appl LVM[8884]: vgchange -a n vgdb2
Dec 27 10:54:48 appl cmcld: Service PKG*18433 terminated due to an exit(0).
Dec 27 10:54:48 appl cmcld: Halted package ORADB on node appl.
Dec 27 10:54:48 appl cmcld: Package ORADB cannot run on this node because switching has been disabled for this node.

You need to check the cntl.log for this package to know the root cause why this package did not started on this node.

Time has a wonderful way of weeding out the trivial

Eric SAUBIGNAC · ‎12-27-2007

Bonjour Kenrick,

have a look in appl's syslog to these events :

Dec 27 09:48:53 appl cmcld: Started package ORADB on node appl.
Dec 27 09:49:23 appl cmcld: Service Oracle_DB terminated due to an exit(1).
Dec 27 09:49:23 appl cmcld: Service Oracle_DB in package ORADB has gone down.
Dec 27 09:49:23 appl cmcld: Disabled node appl from running package ORADB.
Dec 27 09:49:23 appl cmcld: Executing '/etc/cmcluster/oradb/oradb.cntl stop' for package ORADB, as service PKG*18433.

30 secondes after the package has been succesfully started, the service associated with the package ends. If the service falls, the package is supposed to be stopped by the cluster then transfered to another node. But no other node is suitable ...

You should investigate more closely to the application : what is this service, how is it configured in the package, why does it stop, and so on ...

I suggest that you should look at package's log file, on node appl, probably under /etc/cmcluster/oradb/oradb.sh.log, it depends how you have built your cluster. Post this file.

If you can't find the explanation, I suggest that you first modify the package on node appl and deactivate the service. Then start the package on node appl. If the package keeps working, then try starting by hand the script or process that was associated with the service to examine what is happening, why it terminates after 30 s.

I have questions about /u01. First is it a component of your application, more exactly does it have an impact on the package ORADB ? If yes, as it is in vg00, it can't follow the package during a failover. So you should have the same /u01 on node datasvr. Right ? So my idea is that there are significant differences between /u01 on node appl and /u01 on node datasvr that could explain why the service works on datasvr and not on appl.

If you have no difference on this specific /u01, keep in mind that both nodes must offer the same environment so that the package ORADB can work. You problem is probably there. In this order of idea, you should also control that /etc/cmcluster/ are the same on both nodes.

Hope this will help

Eric

Kenrick Sy · ‎12-27-2007

Hi Eric,

Attached are the oradb.sh and oradb.sh.abort.log files. /u01 is where the Oracle applicationwas installed.

It resides on the hard disk of the server. /u02, /u03, /u04, /u05, /u06 and/u07 resides on the SC10. I have the same /u01 on both appl and datasvr. This cluster system has been running for more than 3 years with the timestamp of oradb.sh set at Nov. 16, 2000.

Thanks.

Kenrick

Kenrick Sy · ‎12-27-2007

Oradb.sh file.

Kenrick Sy · ‎12-27-2007

oradb.cntl.log file for appl

Eric SAUBIGNAC · ‎12-27-2007

kenrick, the .log file would be more usefull

Kenrick Sy · ‎12-27-2007

oradb.sh.abort log file

Mridul Shrivastava · ‎12-27-2007

*** ora_pmon_feu has failed at startup time, ABORTING Oracle! ***
ORADB_MONITOR: Exiting with failed status 1

########### Node "appl": Halting package at Thu Dec 27 10:54:38 EAT 2007 ###########
Dec 27 10:54:38 - Node "appl": Halting service Oracle_DB
cmhaltserv : Service name Oracle_DB is not running.

*** /etc/cmcluster/oradb/oradb.sh called with shutdown argument! ***

It was because of improper starting of oracle.. which resulted in shutdown of oracle by monitoring script.

Time has a wonderful way of weeding out the trivial

Eric SAUBIGNAC · ‎12-27-2007

Kenrick, oradb.cntl.log is OK. My previous post came before i see this log file :-(

well there is clearly a problem with the database itself : no process oracle, but listener, in the ps !!! It seems that the database crashes some seconds after starting. Since service monitors oracle's processes like ora_pmon_XXX, service goes down, then package goes down.

So in order to investigate you must modify the package so that it does not start the database, nor the service. The package oradb should only mount filesystems, assign floating IP, that's all.

If you need help to do that, post file oradb.cntl, i guess it will be standard and not very difficult to modify.

Once done, you will be able to start the package and the file systems will stay mounted. So a dba admin wil be able to analyze what's happening with Oracle. But i am not DBA admin ;-(

Eric

Eric SAUBIGNAC · ‎12-27-2007

Kenrick,

finally the database doesn't start at all !!!

Oracle8i Enterprise Edition Release 8.1.6.0.0 - Production
With the Partitioning option
JServer Release 8.1.6.0.0 - Production

SVRMGR> Connected.
SVRMGR> ORA-27146: post/wait initialization failed
SVRMGR>
Server Manager complete.

Eric

Kenrick Sy · ‎12-27-2007

Hi Eric,

Thanks for your help. I left the site already. I will post oradb.cntl file tomorrow.

Regards,

Kenrick

Alexandr Khristenko · ‎12-29-2007

Hi Kenrick..
Try
cmapplyconf -P /etc/cmcluster//.conf
and say "yes"

Best RGS
Alex

Alexandr Khristenko · ‎12-29-2007

Hi Kenrick..
Try
cmapplyconf -P /etc/cmcluster//.conf
say "yes" and try again run cluster

Best RGS
Alex

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

MC ServiceGuard Cluster Failover Problem

MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem

Re: MC ServiceGuard Cluster Failover Problem