Operating System - Linux
1828238 Members
2237 Online
109975 Solutions
New Discussion

cannot start service guard

 
J.D._3
Frequent Advisor

cannot start service guard

Running MCSG ver A.11.14.02-99 on 2 Linux Red Hat version 7.3, Kernel 2.4.18-19.7.xsmp. Configured on a 2 node cluster and a quorum server. A package was hung on a starting status so I could not stop the cluster. Rebooted the 1st node and it would not join the cluster with cmrunnode. Halted the the cluster on the second node but now the cluster would not start. It was stopping and starting fine before execpt without any packages. Had problems with starting the 1 package. This is by the way a newly configured cluster. Now, the cluster is down. When trying to run cmruncl on the 1st node (rebooted node) it complained with this error:

[root@seacliff cmcluster]# cmruncl
Unable to open communications to configuration daemon: Connection refused
Unable to connect to configuration database.
Unable to open communications to configuration daemon: No such file or directory
cmruncl : Unable to determine the nodes on the current cluster
cmruncl : Either no cluster configuration file exists, or the file is corrupted, or cmclconfd is unable to run
[root@seacliff cmcluster]# ps -ef |grep cmclconfd
root 1696 1015 0 03:51 ? 00:00:00 cmclconfd -p
root 1698 1518 0 03:51 pts/0 00:00:00 grep cmclconfd

When trying on the other node it complained with this error:

[root@augusta cmcluster]# cmruncl
Cannot connect to configuration daemon (cmclconfd) on node seacliff
Unable to execute command remotely

I tried copying the whole /usr/local/cmcluster to the rebooted node so I reapply the ascii config file but no luck. cmviewcl on the rebooted node had this error:

cmveiwcl : Unable to query the package information: cluster my be reforming, try again: Communication error on send.

cmviewcl on the other node comes up with cluster down and nodes down and packages down. The /var/log/messages on the rebooted server indicates the following:

Jul 18 03:17:15 seacliff cmcluster: Created /dev/deadman c 10 63
Jul 18 03:17:15 seacliff cmcluster: /usr/local/cmcluster/bin/cmresmond failed to
start. See /usr/local/cmcluster/ResMonServer.log for details.
Jul 18 03:17:16 seacliff CM-CMD[1361]: /usr/local/cmcluster/bin/cmrunnode -v
Jul 18 03:17:16 seacliff cmcluster: ERROR: Unable to join cluster.

Any help on this is greatly appreciated. Thank you.

21 REPLIES 21
Steven E. Protter
Exalted Contributor

Re: cannot start service guard

Two thoughts:

1) piranha
2) Software Service Contract with HP. Service guard is powerful, but hard.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
J.D._3
Frequent Advisor

Re: cannot start service guard

I was hoping no one will point me to that direction. I'm wondering if anybody has had the same or similar problem? I'm trying not to have to rebuild from scratch. Thanks.
Stuart Browne
Honored Contributor

Re: cannot start service guard

I've not used these things (specifically), but I have used Piranah before.

Doesn't it require full remote-host equivalency for the user running the service guard (i.e. root can rsh etc.) ?

Also check to make sure the firewall isn't interfering.
One long-haired git at your service...
J.D._3
Frequent Advisor

Re: cannot start service guard

Stuart,
I could ping and rsh from either machine. I can rsh to itself as well. Telnet works for both machines also. No firewalls between them. Both are on the same subnet.
Balaji N
Honored Contributor

Re: cannot start service guard

hi,
when u say rsh is working, i assume rsh as root.

are u sure there is no problem with the subnet configuration in cluster configuration file. i have had problems with incorrect entries. can u check that as well.

-balaji(based on my exp with MCSG on hp ux)
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Balaji,

Yes, I'm using root all the way. Can u point me in the direction to how to troubleshoot the subnet configuration or connectivity?

Thanks
Balaji N
Honored Contributor

Re: cannot start service guard

hi,

can you post the following

1. ifconfig -a on both the machines.

2. the relevant configuration of the HB on both the machines.

let me or the forumers out here see if anything is fishy.

-b-
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

For node seacliff:

eth0 Link encap:Ethernet HWaddr 00:30:48:24:ED:E0
inet addr:172.31.50.233 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:389151 errors:0 dropped:0 overruns:0 frame:0
TX packets:213381 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:86488658 (82.4 Mb) TX bytes:43372291 (41.3 Mb)
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:2 Link encap:Ethernet HWaddr 00:30:48:24:ED:E0
inet addr:172.31.50.124 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:3 Link encap:Ethernet HWaddr 00:30:48:24:ED:E0
inet addr:172.31.50.125 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:4 Link encap:Ethernet HWaddr 00:30:48:24:ED:E0
inet addr:172.31.50.220 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:5 Link encap:Ethernet HWaddr 00:30:48:24:ED:E0
inet addr:172.31.50.126 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:23387 errors:0 dropped:0 overruns:0 frame:0
TX packets:23387 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2323018 (2.2 Mb) TX bytes:2323018 (2.2 Mb)

For node augusta:

eth0 Link encap:Ethernet HWaddr 00:30:48:23:86:0E
inet addr:172.31.50.208 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:17813240 errors:0 dropped:0 overruns:0 frame:0
TX packets:25421266 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2779927657 (2651.1 Mb) TX bytes:3875068562 (3695.5 Mb)
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:1 Link encap:Ethernet HWaddr 00:30:48:23:86:0E
inet addr:172.31.50.127 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

eth0:2 Link encap:Ethernet HWaddr 00:30:48:23:86:0E
inet addr:172.31.50.124 Bcast:172.31.50.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0x4400 Memory:fc321000-fc321038

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:431850 errors:0 dropped:0 overruns:0 frame:0
TX packets:431850 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:51640582 (49.2 Mb) TX bytes:51640582 (49.2 Mb)
Balaji N
Honored Contributor

Re: cannot start service guard

hi

u have too many ip addresses.

are u running so many packages. btw, even if they are the ip address of packages, they get activated only when the cluster and package comes up right?

and can u show the relevant snippets from ur cluster configuration file.
-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Balaji,
Here are the HB configurations. At the moment I just have one physical card active. I ran out of ports on the switch for another card.

augusta:

DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
BROADCAST=172.31.50.255
NETWORK=172.31.50.0
NETMASK=255.255.255.0
IPADDR=172.31.50.208
USERCTL=false

seacliff:

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=static
BROADCAST=172.31.50.255
NETWORK=172.31.50.0
NETMASK=255.255.255.0
IPADDR=172.31.50.233

On the 5 packages' config file, I have:
SUBNET 172.31.50.0

On the cluster config file:

NODE_NAME augusta
NETWORK_INTERFACE eth0
HEARTBEAT_IP 172.31.50.208

NODE_NAME seacliff
NETWORK_INTERFACE eth0
HEARTBEAT_IP 172.31.50.233
J.D._3
Frequent Advisor

Re: cannot start service guard

Yes they are 5 Virtual IP's and are supposed to be brought up when the packages come up. 2 packages run on augusta and the others on the other node. I just brought them back up vi ifconfig to see it that has anything to do with the problem. I should have mentioned that before or taken them out before posting the reply.
Balaji N
Honored Contributor

Re: cannot start service guard

is the quorum server up?
-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Yes the quorum server is up and I did a ps -ef |grep qs and saw 2 processes (qsc -authfile) running. It is also entered in the /etc/inittab file to respawn.

I found this in the /usr/local/qs/log.

Jul 17 23:34:37:0:Request for lock /sg/cms_cluster succeeded. New lock owners: s
eacliff,augusta
Jul 18 00:20:26:0:Request for lock /sg/cms_cluster succeeded. New lock owners: s
eacliff,augusta
Jul 18 00:26:07:0:Request for lock /sg/cms_cluster succeeded. New lock owners: s
eacliff,augusta
Jul 18 01:51:08:0:Request for lock /sg/cms_cluster succeeded. New lock owners: a
ugusta
[root@qsla1 log]# pwd
/usr/local/qs/log
[root@qsla1 log]#

The last entry for a request only have augusta for an owner. Any ideas?
Balaji N
Honored Contributor

Re: cannot start service guard

is cmcld running on seaclf.

btw, which is ur quorum server! augusta?

-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Balaji,

The quorum server is qs1. No there are no cluster daemon up on that node. I had the AUTOSTART_CMCLD=1 parameter set. None came up after the reboot.

Balaji N
Honored Contributor

Re: cannot start service guard

can u try starting it manually and see if it helps.

-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Tried manually starting it by /etc/init.d/cmcluster.init start
I got the message below.

File /usr/local/cmcluster/conf/cmresmond_config.xml does not exist
udp 0 0 0.0.0.0:5302 0.0.0.0:*
/etc/init.d/cmcluster.init: rm: command not found
cmrunnode : Unable to determine the nodes on the current cluster
cmrunnode : Either no cluster configuration file exists, or the file is corrupted, or cmclconfd is unable to run
Unable to open communications to configuration daemon: Connection refused
Unable to connect to configuration database.
Unable to open communications to configuration daemon: No such file or directory
ERROR: Unable to join cluster
Balaji N
Honored Contributor

Re: cannot start service guard

rm not found.

any chance with ur files are not proper are incorrect path settings.
-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Balaji,

The files seems to be in order. I'm not sure why it doesn't see where rm is. The cmcluster.init script is untouched. It uses the absoulute path for rm which is /bin/rm. It is removing the /dev/deadman file by using /bin/rm -rf. I verified it through the command line both with absolute path and just with rm. It's in root's search path. I tried to remove seacliff from the cluster configuration but I could run a cmgetconf or cmquerycl to create a new ascii file. The cluster would not run from either node because it gets hung up on trying to probe seacliff which the other node couldn't do it. Is there other options for me to use? Thanks again.
Balaji N
Honored Contributor

Re: cannot start service guard

sorry. i had tried all my little knowledge. best soln would now be to get in touch with HP support.
-balaji
Its Always Important To Know, What People Think Of You. Then, Of Course, You Surprise Them By Giving More.
J.D._3
Frequent Advisor

Re: cannot start service guard

Balaji,

Thanks for all your help. We didn't find the culprit of the problem but I found out that there was a function that is called but is already being called from the SERVICE_CMD parameter. Here's what I did to get it started again.

Rebooted both servers and found out that neither can join the cluster anymore. Both are now getting the same message as seacliff.

- Copied /usr/local/cmcluster directory to /usr/local/cmcluster.old on one and blew it away on the other node
- Uninstalled Service Guard on both nodes.
- Reinstalled Service Guard on both nodes.
- Kept Quorum server as configured.
- Verified that I could run cmviewcl to inform me that SG is not configured on both nodes.
- Recreated cluster ascii file and plugged in the entries from old file
- Recreated pkg config file and plugged in the enties from the old file
- cmcheckconf & cmapplyconf
- It's now back and running.

Thanks again for everyone's input.