topic Re: network problem starting cluster in Operating System - HP-UX

network problem starting cluster

Rob Payne — Thu, 14 Jul 2005 09:40:36 GMT

got past my other problem, now attempting to start the cluster, it fails with the following info from /var/adm/syslog/syslog.log:

Jul 14 12:27:06 jmar1 SAM cl adm[6569]: Start cluster jmar_cluster1 on all nodes
Jul 14 12:27:09 jmar1 cmclconfd[6576]: Executing "/usr/lbin/cmcld" for node jmar1
Jul 14 12:27:09 jmar1 cmcld: Logging level changed to level 0.
Jul 14 12:27:09 jmar1 cmcld: Daemon Initialization - Maximum number of packages supported for this incarnation is 10.
Jul 14 12:27:09 jmar1 cmcld: Global Cluster Information:
Jul 14 12:27:09 jmar1 cmcld: Heartbeat Interval is 1 seconds.
Jul 14 12:27:09 jmar1 cmcld: Logging level changed to level 0.
Jul 14 12:27:09 jmar1 cmcld: Node Timeout is 2 seconds.
Jul 14 12:27:09 jmar1 cmcld: Network Polling Interval is 2 seconds.
Jul 14 12:27:09 jmar1 cmcld: Auto Start Timeout is 600 seconds.
Jul 14 12:27:09 jmar1 cmcld: Information Specific to node jmar1:
Jul 14 12:27:09 jmar1 cmcld: Cluster lock disk: /dev/dsk/c9t0d0.
Jul 14 12:27:09 jmar1 cmcld: lan0 0x00306e0960b2 140.139.46.121 bridged net:1
Jul 14 12:27:09 jmar1 cmcld: lan1 0x00306e08171b 10.1.1.1 bridged net:2
Jul 14 12:27:09 jmar1 cmcld: Heartbeat Subnet: 10.0.0.0
Jul 14 12:27:09 jmar1 cmcld: The maximum # of concurrent local connections to the daemon that will be supported is 1014.
Jul 14 12:27:09 jmar1 cmcld: Lookup of link /nodes/jmar1/networks/lan/lan1/peers failed.
Jul 14 12:27:09 jmar1 cmcld: Unable to send DLPI info request, Bad file number
Jul 14 12:27:09 jmar1 cmcld: cl_abort: abort cl_kepd_printf failed: Invalid argument
Jul 14 12:27:09 jmar1 cmcld: cl_kepd_printf, fstat: kepd_fd=8, st_dev=1073741827, st_ino=446, st_rdev=-486539264
Jul 14 12:27:09 jmar1 cmcld: Aborting! Failed to communicate with DLPI
Jul 14 12:27:09 jmar1 cmlvmd: init_cdb_callback: starting
Jul 14 12:27:09 jmar1 cmcld: Waiting for connection request from CMGMSD
Jul 14 12:27:09 jmar1 cmcld: CMGMSD successfully started
Jul 14 12:27:12 jmar1 cmsrvassistd[6580]: The cluster daemon aborted our connection.
Jul 14 12:27:12 jmar1 cmsrvassistd[6580]: Lost connection with ServiceGuard cluster daemon (cmcld): Software caused connection
abort
Jul 14 12:27:12 jmar1 cmlvmd: callback_thread: Calling process callback
Jul 14 12:27:12 jmar1 cmlvmd: Could not read messages from /usr/lbin/cmcld: Software caused connection abort
Jul 14 12:27:12 jmar1 cmlvmd: CLVMD exiting
Jul 14 12:27:12 jmar1 cmgmsd[6587]: The cluster daemon aborted our connection.
Jul 14 12:27:12 jmar1 cmgmsd[6587]: Unable to send 92 bytes (Software caused connection abort).
Jul 14 12:27:12 jmar1 cmclconfd[6578]: The cluster daemon aborted our connection.
Jul 14 12:27:12 jmar1 cmclconfd[6576]: The ServiceGuard daemon, /usr/lbin/cmcld[6577], died upon receiving signal number 6.
Jul 14 12:27:12 jmar1 cmclconfd[6589]: Failed to open connection to cmcld: No such file or directory
Jul 14 12:27:12 jmar1 cmtaped[6588]: cmtaped - failed to set up sdb callback. (ATS 1.8)
Jul 14 12:27:12 jmar1 cmtaped[6588]: Failed to set callback: 6004
Jul 14 12:28:04 jmar1 SAM cl adm[6569]: Fail to form and start cluster jmar_cluster1

It appears be a network problem of some sort;
I have attached the cluster config file

cheers .. rob

Re: network problem starting cluster

melvyn burnard — Thu, 14 Jul 2005 09:56:42 GMT

Ok, a few questions to ask here.
what version of SG, and what version of SGeRAC
What SG and SGeRAC patches are installed?
Please run cmscancl and post the contents of the /tmp/scancl.out file

Re: network problem starting cluster

Rob Payne — Thu, 14 Jul 2005 10:26:56 GMT

MC Serviceguard A.11.15.00
Serviceguard Extension for RAC A.11.15.00

scancl.out is attached

thanks .. rob

Re: network problem starting cluster

melvyn burnard — Thu, 14 Jul 2005 11:44:42 GMT

Well from the scancl.out, it seems you have two lan cards configured on the same subnet, which is not allowed for SG:
lan2* 1500 10.0.0.0 10.1.1.2
lan1 1500 10.0.0.0 10.1.1.1

I would also suggest that you change the heartbeat and node timeout intervals before this goes into production, as the default settings are normally insufficient:
heartbeat interval: 1.00 (seconds)
node timeout: 2.00 (seconds)

I would suggest changing these to 2 and 4 seconds respectively

One possibility I have seen before is that the CDB has got corrupted.
If sorting out the above network config does not fix it, you may be forced to try deleting the config, using cmdeletconf, and then recreate it

As a final comment, you appear not have either SG or SGeRAC patched.

to check do:
what /usr/lbin/cmcld |grep PHSS

and

what /usr/lbin/cmgmsd | grep PHSS

If not, obtain these patches from the ITRC

Re: network problem starting cluster

Rob Payne — Fri, 15 Jul 2005 07:42:02 GMT

# what /usr/lbin/cmcld |grep PHSS
A.11.15.00 Date: 09/16/03 Patch: PHSS_29053
# what /usr/lbin/cmgmsd |grep PHSS
A.11.15.00 Date: 03/09/05 Patch: PHSS_32859

This is what I got; are these the patches that are already installed, or patches that I need to install

cheers ... rob

Re: network problem starting cluster

Rob Payne — Fri, 15 Jul 2005 08:04:29 GMT

ok ... added another patch to serviceguard (PHSS_32660); here is the what output:

# what /usr/lbin/cmcld |grep PHSS
A.11.15.00 Date: 03/09/05 Patch: PHSS_32660
# what /usr/lbin/cmgmsd |grep PHSS
A.11.15.00 Date: 03/09/05 Patch: PHSS_32859
#

Re: network problem starting cluster

vinod_25 — Sun, 24 Jul 2005 23:31:14 GMT

hi rob

Since the cmcld process aborted there should be a core file in /var/adm/cmcluster.
A. Verify that the core file creation time matches the time of the dump.
B. Use adb to obtain the stack from the core file:

# adb cmcld core

attach the output

regards

vinod