running cluster services on 2nd node causes first to fail

Marty Hoff · ‎03-01-2004

We recently applied the Dec. 2003 11i patch pack to our cluster. After the installation, service guard will not run correctly.

Using cmrunnode I can bring one machine up into the cluster state by itself but when I try to bring the 2nd machine into the cluster, it causes the first machine to fail.

If I am actually running packages on the first machine, then the first machine hangs and reboots. As you can imagine, this is not a good situation.

We are running 11.11 on both machines and Service Guard version is A.11.09. I am running the latest patch PHSS_27158, although I have problems after I remove that patch as well.
(I tried removing it and then running the cluster, no help. I reinstalled the patch.)

Below is the sequence of messages in the syslog from the 1st node (called switch) when the 2nd node (called mouse) tried to join the cluster. Not long after this, switch failed completely and rebooted itself.

I was then able to bring the cluster up on the 2nd node, once the 1st node had failed. However, when the 1st node finished rebooting, it caused the 2nd node to panic in the same manner when the cluster services tried to start at boot time.

Anyone have any advice? Thanks,

Marty

Mar 2 04:42:41 switch cmcld: New node mouse is joining the cluster
Mar 2 04:42:41 switch cmcld: Error. Cannot accept mouse into the cluster
Mar 2 04:42:41 switch cmcld: Please use cmrunnode instead of cmruncl
Mar 2 04:42:41 switch cmcld: Attempting to kill node mouse
Mar 2 04:42:41 switch cmcld: Reason: Incorrect use of a cluster run command
Mar 2 04:42:53 switch cmcld: New node mouse is joining the cluster
Mar 2 04:42:53 switch cmcld: Error. Cannot accept mouse into the cluster
Mar 2 04:42:53 switch cmcld: Please use cmrunnode instead of cmruncl
Mar 2 04:42:53 switch cmcld: Attempting to kill node mouse
Mar 2 04:42:53 switch cmcld: Reason: Incorrect use of a cluster run command
Mar 2 04:43:43 switch cmcld: New node mouse is joining the cluster
Mar 2 04:43:43 switch cmcld: Attempting to adjust cluster membership
Mar 2 04:43:47 switch cmcld: Enabling safety time protection
Mar 2 04:43:47 switch cmcld: Clearing First Dual Cluster Lock
Mar 2 04:43:47 switch vmunix: Failed to set socket receive buffer, Invalid argument
Mar 2 04:43:47 switch vmunix: Service Guard Aborting!
Mar 2 04:43:47 switch vmunix: Cause: setsockopt failed
Mar 2 04:43:47 switch cmcld: Failed to set socket receive buffer, Invalid argument
Mar 2 04:43:48 switch vmunix: (File: comm_ip.c, Line: 5709)
Mar 2 04:43:48 switch vmunix: Aborting! setsockopt failed
Mar 2 04:43:48 switch vmunix: (file: comm_ip.c, line: 5709)
Mar 2 04:43:47 switch cmcld: Aborting! setsockopt failed
Mar 2 04:43:49 switch cmlvmd: Could not read messages from /usr/lbin/cmcld: Software caused connection abort
Mar 2 04:43:49 switch cmlvmd: CLVMD exiting
Mar 2 04:43:49 switch cmtaped[12368]: Lost connection to the cluster daemon.
Mar 2 04:43:49 switch cmtaped[12368]: cmtaped terminating. (ATS 1.14)
Mar 2 04:43:49 switch cmclconfd[12360]: The ServiceGuard daemon, /usr/lbin/cmcld[12361], died upon receiving the signal 6.
Mar 2 04:43:49 switch cmsrvassistd[12365]: Lost connection to the cluster daemon.
Mar 2 04:43:49 switch cmsrvassistd[12365]: Lost connection with ServiceGuard cluster daemon (cmcld): Software caused connection abort

Jeff Schussele · ‎03-01-2004

Hi Marty,

Not positive, but I believe 11.09 may no longer be supported - 11.15 is current. I would *strongly* recommend that you come up to at least 11.13 or 11.14. You may be running into a conflict with that SG version & a new patch.

Also you need only run cmruncl on a *single* node to initially start the cluster. Then if you need to have another node join, you run cmrunnode on it. Try that first & if still trouble, then it's time to upgrade MC/SG. I'd be planning on that anyway.

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Karthik S S · ‎03-01-2004

Are the cluster lock and other shared disks accesible from both the nodes??.

-Karthik S S

For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn

melvyn burnard · ‎03-01-2004

Well this has nothing to do with shared discs, nor cluster lock disc.
It appears that there may be some with the network that may be causing sockets to become lost or unavailable.
As already stated, 11.09 is no longer supported, you should seriously consider updating to 11.14 or 11.15, with patches.

A few suggestions to start with (for both nodes).
1) confirm you do have the 11.09 SG patch, do:
what /usr/lbin/cmcld
and verify the version and patch version
2) confirm all those patches installed and configured ok, do:
swlist -l fileset -a state |grep -e corrupt -e transient -e install
and see if anything gets returned
3) again using swlist, check you have the correct fileset for the version of Serviceguard you are running.
4) confirm the correct command was used to get the second node to join the running cluster (as per the log file use cmrunnode).
If the cluster is actually running on one node, you could consider removing the failing node from the running cluster, and then re-adding it to see if this fixes the symptoms.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Marty Hoff · ‎03-02-2004

Where is the upgrade to service guard available? We do not have a service contract with HP at the moment.

melvyn burnard · ‎03-02-2004

The newer software is available as part of your software contract with HP.
If you do not have a contract, you will need to purchase it I am afraid.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Jeff Schussele · ‎03-02-2004

Hi Marty,

Well, if you had one then you could load:

MC/SG 11.14 => PHSS_30028
MC/SG 11.15 => PHSS_30087

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Marty Hoff · ‎03-02-2004

The following patches say "installed" instead of configured. How do I fix this to see if it is causing the problem?

Thanks.

PHKL_25209.CORE2-KRN installed
PHKL_25212.C-INC installed
PHKL_25212.CORE-KRN installed
PHKL_25212.CORE2-KRN installed
PHKL_25238.CORE2-KRN installed
PHKL_25367.CORE2-KRN installed
PHKL_25368.CORE2-KRN installed
PHKL_25375.CORE2-KRN installed
PHKL_25428.KERN2-RUN installed
PHKL_25506.C-INC installed
PHKL_25506.CORE-KRN installed
PHKL_25506.CORE2-KRN installed
PHKL_25593.CORE2-KRN installed
PHKL_25602.CORE2-KRN installed
PHKL_25761.CORE2-KRN installed
PHKL_25773.CORE2-KRN installed
PHKL_25871.CORE2-KRN installed
PHKL_26002.CORE2-KRN installed
PHKL_26032.C-INC installed
PHKL_26032.CORE2-KRN installed
PHKL_26074.CORE2-KRN installed
PHKL_26087.CORE2-KRN installed
PHKL_26104.VXFS-BASE-KRN installed
PHKL_26269.C-INC installed
PHKL_26269.CORE-KRN installed
PHKL_26405.C-INC installed
PHKL_26405.CORE-KRN installed
PHKL_26405.CORE2-KRN installed
PHKL_26425.CORE-KRN installed
PHKL_26425.CORE2-KRN installed
PHKL_26464.CORE2-KRN installed
PHKL_26552.VXFS-BASE-KRN installed
PHKL_26705.CORE2-KRN installed
PHKL_26719.CORE2-KRN installed
PHKL_26755.CORE2-KRN installed
PHKL_26834.CORE2-KRN installed
PHKL_27025.CORE2-KRN installed
PHKL_27025.KERN2-RUN installed
PHKL_27054.CORE2-KRN installed
PHKL_27179.CORE2-KRN installed
PHKL_27200.CORE2-KRN installed
PHKL_27266.CORE2-KRN installed
PHKL_27304.C-INC installed
PHKL_27304.CORE-KRN installed
PHKL_27304.CORE2-KRN installed
PHKL_27304.KERN2-RUN installed
PHKL_27321.CORE2-KRN installed
PHKL_27431.KERN2-RUN installed
PHKL_27447.CORE2-KRN installed
PHKL_27498.CORE2-KRN installed
PHKL_27531.CORE2-KRN installed
PHKL_27532.CORE2-KRN installed
PHKL_27682.CORE2-KRN installed
PHKL_27686.CORE2-KRN installed
PHKL_27688.CORE2-KRN installed
PHKL_27715.CORE2-KRN installed
PHKL_27727.CORE2-KRN installed
PHKL_27734.VXFS-BASE-KRN installed
PHKL_27737.CORE2-KRN installed
PHKL_27751.CORE2-KRN installed
PHKL_27751.FCMS-ENG-A-MAN installed
PHKL_27757.CORE2-KRN installed
PHKL_27778.CORE2-KRN installed
PHKL_27839.CORE2-KRN installed
PHKL_27918.CORE2-KRN installed
PHKL_27949.CORE2-KRN installed
PHKL_28025.C-INC installed
PHKL_28025.CORE-KRN installed
PHKL_28025.CORE2-KRN installed
PHKL_28100.CORE2-KRN installed
PHKL_28113.CORE2-KRN installed
PHKL_28114.CORE2-KRN installed
PHKL_28185.VXFS-BASE-KRN installed
PHKL_28326.CORE2-KRN installed
PHKL_28410.CORE2-KRN installed
PHNE_24829.INET-ENG-A-MAN installed
PHNE_24829.INETSVCS-RUN installed
PHNE_24829.NET2-KRN installed
PHNE_25083.STRTIO-KRN installed
PHNE_25083.STRTIO2-KRN installed
PHNE_26939.GE-KRN installed
PHNE_26939.GE-RUN installed

Marty Hoff · ‎03-02-2004

4) confirm the correct command was used to get the second node to join the running cluster (as per the log file use cmrunnode).
If the cluster is actually running on one node, you could consider removing the failing node from the running cluster, and then re-adding it to see if this fixes the symptoms.

----------------
How do you remove a machine from the cluster and then re-add it? Sorry for the basic questions, but I am a relative novice at HP administration (our HP fellow was laid off some time ago).

Thanks.

Marty

Jeff Schussele · ‎03-02-2004

Hi Marty,

Run the following

swconfig \*

to configure installed filesets. If that fails you may have to force-reinstall them or even swremove & reinstall 'em.

HTH,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

running cluster services on 2nd node causes first to fail

running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail

Re: running cluster services on 2nd node causes first to fail