Serviceguard
cancel
Showing results for 
Search instead for 
Did you mean: 

Error while switching the cluster

expertone
Advisor

Error while switching the cluster

I am explaining you the senario.

 

We have two nodes and on both node we have insatalled redhat linux and for the cluster we have install HP service gaurd.

 

The problem what we are facing is , when our NODE1 goes down the confingured NODE2 for failover does not take place automatically . I dont know about if it shifts mannually because I was assigned this case now and no one is there for details.

 

Also when I am using cmviewcl  -v i got the output on node1

 

CLUSTER STATUS

JISP_DATABASE_CLUSTER up

NODE STATUS STATE

hathdb1 up running

Cluster_Lock_LUN:

DEVICE STATUS

/dev/cciss/c0d0p1 up

Network_Parameters:

INTERFACE STATUS NAME

PRIMARY up eth0

PRIMARY up eth1

PACKAGE STATUS STATE AUTO_RUN NODE

oracle up running disabled hathdb1

Policy_Parameters:

POLICY_NAME CONFIGURED_VALUE

Failover configured_node

Failback manual

Script_Parameters:

ITEM STATUS MAX_RESTARTS RESTARTS NAME

Service up 0 0 oracle_db_mon

Service up 5 0 oracle_lsnr_mon

Subnet up 202.88.149.0

Subnet up 192.168.0.0

Node_Switching_Parameters:

NODE_TYPE STATUS SWITCHING NAME

Primary up enabled hathdb1 (current)

Alternate up enabled hathdb2

NODE STATUS STATE

hathdb2 up running

Cluster_Lock_LUN:

DEVICE STATUS

/dev/cciss/c0d0p1 up

Network_Parameters:

INTERFACE STATUS NAME

PRIMARY up eth0

PRIMARY up eth1

 

and when i go to /oracle and more to clusterciew i got the output as below

 

[root@hathdb1 oracle]#

[root@hathdb1 oracle]# ll

total 2352

-rw-r--r-- 1 root root 8106 Feb 14 2011 1

drwxr-xr-x 2 root root 4096 Apr 24 2010 backup

-rw-r--r-- 1 root root 1603 Apr 9 2010 clusterciew

-rwx------ 1 root root 8105 Feb 15 2011 oracle.conf

-rwx------ 1 root root 8106 Feb 14 2011 oracle.conf-FEB02

-rwx------ 1 root root 8105 Aug 18 2010 oracle.conf.old

-rwx------ 1 root root 39407 Aug 4 2011 oracle.ctrl

-rwx------ 1 root root 39407 Aug 4 2011 oracle.ctrl_04-08-2011

-rwx------ 1 root root 39407 Feb 7 2007 oracle.ctrl.back

-rwx------ 1 root root 39407 Aug 18 2010 oracle.ctrl.back.old

-rwx------ 1 root root 39457 Feb 14 2011 oracle.ctrl-FEB02

-rwx------ 1 root root 39407 Feb 14 2011 oracle.ctrl-FEB-13-11

-rw-r--r-- 1 root root 610796 Mar 11 15:25 oracle.ctrl.log

-rw-r--r-- 1 root root 1454460 Apr 9 2010 oracle.ctrl.log_primary

-rwx------ 1 root root 39407 Aug 18 2010 oracle.ctrl.old

[root@hathdb1 oracle]# more

usage: more [-dflpcsu] [+linenum | +/pattern] name1 name2 ...

[root@hathdb1 oracle]#

[root@hathdb1 oracle]#

[root@hathdb1 oracle]#

[root@hathdb1 oracle]#

[root@hathdb1 oracle]# more clusterciew

CLUSTER STATUS

JISP_DATABASE_CLUSTER down

NODE STATUS STATE

hathdb1 down unknown

Cluster_Lock_LUN:

DEVICE STATUS

/dev/cciss/c0d0p1 unknown

Network_Parameters:

INTERFACE STATUS NAME

PRIMARY unknown eth0

PRIMARY unknown eth1

NODE STATUS STATE

hathdb2 down unknown

Cluster_Lock_LUN:

DEVICE STATUS

/dev/cciss/c0d0p1 unknown

Network_Parameters:

INTERFACE STATUS NAME

PRIMARY unknown eth0

PRIMARY unknown eth1

UNOWNED_PACKAGES

PACKAGE STATUS STATE AUTO_RUN NODE

oracle down unowned

Policy_Parameters:

POLICY_NAME CONFIGURED_VALUE

Failover unknown

Failback unknown

Script_Parameters:

ITEM STATUS NODE_NAME NAME

Subnet unknown hathdb1 202.88.149.0

Subnet unknown hathdb1 192.168.0.0

Subnet unknown hathdb2 202.88.149.0

Subnet unknown hathdb2 192.168.0.0

Node_Switching_Parameters:

NODE_TYPE STATUS SWITCHING NAME

Primary down hathdb1

Alternate down hathdb2

[root@hathdb1 oracle]#

[root@hathdb1 oracle]#

 

i didnot understand this one can you explain why it is so.

 

Also I am new to cluster so give me the solution i will be thankfull to you.

 

For details please find the attachment.

 

Thanks and regaurd.

Ashish

5 REPLIES
Matti_Kurkela
Honored Contributor

Re: Error while switching the cluster

> PACKAGE STATUS STATE AUTO_RUN NODE

> oracle up running disabled hathdb1

 

This is the problem.

 

When AUTO_RUN is disabled, the automatic failover will not happen.

 

When you use the cmhaltpkg to halt a package, it will automatically disarm the AUTO_RUN for that package as a side effect. If you start the package using cmrunpkg, you must remember to re-arm the automatic failover using the "cmmodpkg -e" command. In the case of the package listed above, the command would be:

# cmmodpkg -e oracle

 

Newer versions of Serviceguard will actually remind you of this requirement each time you use the cmhaltpkg/cmrunpkg commands.

 

If the package is started as part of a cluster startup (cmruncl), then the AUTO_RUN state of each package will automatically be set to the default value set in the package configuration.

 

Note: there is also another form of the cmmodpkg command: "cmmodpkg -n <node_name> -e <package_name>". It looks very similar to the command listed above, but has a different purpose. If a package is disabled from starting on a particular node (e.g. because it failed starting up last time Serviceguard tried it there), you can use this command to re-enable it. In effect, this command tells Serviceguard: "The problem that prevented this package from starting on this node is now fixed, the package can be allowed to run on this node again."

 

 

The timestamp of the /oracle/clusterciew file is April 9, 2010. So it's almost three years old now. It seems to be an old copy of cmviewcl output, and certainly not relevant any more. The only thing it can tell you is that the cluster was down at some time on that day. This file is not used by standard Serviceguard configuration: most likely you can just delete it.

MK
expertone
Advisor

Re: Error while switching the cluster

Thanks MK.

so there is now relation between /oracle/clusterciew file is April 9, 2010 and the present one.

One more thing i want to ask you is i have searched every where in the server but i was not able to find the cmclustercl.ascii file. What i have seen in HP UX that there is an ascii file, is that is not the case with the Linux?

So if I am not wrong the above is the only error in my cluster and i have to fix that.
Matti_Kurkela
Honored Contributor

Re: Error while switching the cluster

The ASCII file is actually used only when submitting configuration changes to Serviceguard using the cmapplyconf command. After the cmapplyconf command has completed successfully, the ASCII configuration file is not needed, since the configuration has been stored in the binary configuration file, which is automatically kept in sync between cluster members by Serviceguard.

 

It is the habit of many Serviceguard administrators (including myself) to leave the latest ASCII file around for documentation/reference purposes, but if you don't have it, you can easily get an ASCII copy of the current configuration  with the cmgetconf command if you need it.

 

(In some situations, it would be better to use cmgetconf instead of relying on old, possibly out-of-date or out-of-sync copies of the ASCII configuration files. Therefore, if the current configuration is sufficiently documented elsewhere, it is possible to argue that it might actually be a Good Thing to *not* have the old ASCII files around: it removes the temptation to look at the possibly-obsolete files and forces the sysadmin to always get the up-to-date configuration information with cmgetconf.

 

The counterargument is that in disaster-recovery situations, having the ASCII files around will speed up recovery. Well, yes; but *only* if they are up to date and if there is no need to significantly change them to adapt to post-disaster situation.)

MK
expertone
Advisor

Re: Error while switching the cluster

Hi MK

But if you can see the attachment that service is enabled ie AUTO_RUN YES

please reply ASAP
Regards Ashish.
Matti_Kurkela
Honored Contributor

Re: Error while switching the cluster

The AUTO_RUN configuration setting and the AUTO_RUN package state are not quite the same thing.

 

When you run "cmviewcl", the AUTO_RUN field in the output indicates the package state.

 

The configuration setting is listed in the package configuration file and is either YES or NO. It determines what happens to the package when the cluster is being started:

  • If it is set to YES, the package will be automatically started at cluster start-up and its AUTO_RUN package state will be set to "enabled" when the cluster startup is completed, and the package will be up & running and fully ready for failover.
  • If it is set to NO, the package will not be automatically started at cluster start-up, and its AUTO_RUN package state will be "disabled" at the end of the cluster start-up.

The AUTO_RUN package state is dynamic: it is maintained by Serviceguard as part of the overall cluster state information. It can be updated by commands like cmhaltpkg and cmmodpkg. Updating the package state will not change the configuration setting: it just means that the package state has been changed from the configured initial state. When the cluster is halted, all the package state information will be forgotten: when the cluster starts up, the package states will be initialized from scratch to the values set in the package configuration file.

 

If the AUTO_RUN package state is "enabled" and the package is down, Serviceguard will immediately attempt to restart the package on the most appropriate node (as determined by the package configuration). For this reason, the cmhaltpkg command must set the AUTO_RUN package state to "disabled" before halting the package, or else Serviceguard would just immediately restart the package. That would be rather silly.

 

Sometimes, you may want to disable failover temporarily but keep the package running on the current node, for example when you're performing maintenance on the alternative node and don't want the package to failover there while you're doing the maintenance work. This is easy to do: just use "cmmodpkg -d <package name>" to set the AUTO_RUN package state to "disabled", and the package won't failover automatically. When the maintenance is over, you can use "cmmodpkg -e <package name>" to restore the package state to "enabled", and the automatic failover will work again.

MK