SecondaryServerMonitoringDaemon cannot start on the compute node

rkrassow · ‎10-22-2015

hi all,

I cannot enable CMU monitoring in my cluster. After starting the monitoring engine on the CMU server the SecondaryServerMonitoringDaemon process cannot be started on the compute node. The process crashes after 5 seconds. As far as I understand CMU tries to restart the process every 30 seconds.

-> ssh-keys are distributed for user root

-> firewall is curently disabled

The installed versions and the logs are provided below. Can somebody give me a hint how too proceed here?

Thank you in advance, Rostislaw

PS: where can I open a support ticket for CMU?

Environment - CMU Server:
[root@asv11slt1 log]# uname -a
Linux asv11slt1 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@asv11slt1 log]# java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
[root@asv11slt1 log]# rpm -qa | grep cmu
cmu-7.3.2-1.x86_64

Environment - compute node:
[root@asv12slt1 log]# uname -a
Linux asv12slt1 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@asv12slt1 log]# java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
[root@asv12slt1 log]# rpm -qa | grep cmu
cmu_cn-7.3.2-1.x86_64

# CMU Server log
[root@asv11slt1 log]# tail MainMonitoringDaemon_asv11slt1.log
[22-Oct-2015_16:09:29] [CMUSecReelector   ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:09:30] [CMUSecReelector   ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:09:59] [CMUSecReelector   ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:10:00] [CMUSecReelector   ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:10:29] [CMUSecReelector   ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:10:30] [CMUSecReelector   ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:10:59] [CMUSecReelector   ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:11:00] [CMUSecReelector   ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:11:29] [CMUSecReelector   ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:11:30] [CMUSecReelector   ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine

# compute node logs
[root@asv12slt1 log]# cat SecondaryServerMonitoring_asv12slt1.log
[22-Oct-2015_16:10:04] [CMUstartup        ] thread test [START]...
[22-Oct-2015_16:10:04] [CMUstartup        ] thread test... [STOP]
[22-Oct-2015_16:10:04] [CMUFileLockTools ] mypid is 19306 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_16:10:04] [CMUstartup        ] monitoring synchro is on
[22-Oct-2015_16:10:04] [CMUstartup        ] monitoring memlock is off
[22-Oct-2015_16:10:04] [CMUstartup        ] monitoring realtime priority parameter is 0
[22-Oct-2015_16:10:04] [CMUFileLockTools ] mypid is 19310 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_16:10:04] [CMUFileLockTools ] killing process 19306 : CMUKillDaemon
[22-Oct-2015_16:10:09] [CMUslaveListener ] Halt single daemon msg received, exiting program MonitSlActOnMessageReceived
[22-Oct-2015_16:10:09] [CMUPthreadTools   ] Fatal, thread_cancel failed could not find thread CMUThreadCancel
[22-Oct-2015_16:10:09] [CMUPthreadTools   ] [Fatal] Error while trying to kill thread MonitRsKillThread
[22-Oct-2015_16:10:09] [Shutdown Module   ] Could not kill CS thread HaltMyThreadsAndDie
[22-Oct-2015_16:10:09] [Shutdown Module   ] Stopping now HaltMyThreadsAndDie
[22-Oct-2015_16:10:09] [Shutdown Module   ] ------------ HaltMyThreadsAndDie

[root@asv12slt1 log]# cat SmallMonitoringDaemon_asv12slt1.log
[22-Oct-2015_14:40:11] [CMUstartup        ] Entering checkBigRegexpBuggy
[22-Oct-2015_14:40:11] [CMUstartup        ] size tested is 21000 checkBigRegexpBuggy
[22-Oct-2015_14:40:11] [CMUstartup        ] Entering checkRedhat8Bug
[22-Oct-2015_14:40:11] [CMUstartup        ] Entering checkThreadLibrary
[22-Oct-2015_14:40:11] [CMUstartup        ] thread test [START]...
[22-Oct-2015_14:40:11] [CMUstartup        ] thread test... [STOP]
[22-Oct-2015_14:40:11] cmuconf         =(null)
,cmuconf_compl   =(null)
,cmu_cluster_conf=(null)
,AAFile          =/opt/cmu/etc/ActionAndAlertsFile.txt
,MAFile          =(null)
,debugLevelMMD   =(null)
,debugLevelSEC   =(null)
,debugLevelSMD   =1
,timestep        =5000000
,master_host_ip =172.23.99.34
,sec_ip          =(null)
,incomingSLPort =48557
,outgoingMSRRPort=48560
,outgoingMSHellop=49074
,do_synchro      =1
,do_memlock      =0
,do_realtime     =0
,cmu_mgt_node_ip =172.23.99.16
,host_ip         =172.23.99.34
,nodes_file_path =(null)

[22-Oct-2015_14:40:11] [CMUFileLockTools ] mypid is 1293 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_14:40:11] [CMUstartup        ] monitoring synchro is on
[22-Oct-2015_14:40:11] [CMUstartup        ] monitoring memlock is off
[22-Oct-2015_14:40:11] [CMUstartup        ] monitoring realtime priority parameter is 0
[22-Oct-2015_14:40:11] [CMUFileLockTools ] mypid is 1293 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_14:40:11] [SmallMonitorDaemon] not starting collectl client Main
...
[22-Oct-2015_14:40:11] [(4)CMUResultSender] Could not extract numerical part of <NOK   14
> MonitRsConvertToDouble
[22-Oct-2015_14:40:11] [(4)CMUResultSender] [Warning] Data conversion failed for Action <eth1_MB/s_tx>value was -->NOK 14
<--, fix /opt/cmu/etc/ActionAndAlertsFile.txt
[22-Oct-2015_14:40:11] [CMUslaveListener ] Halt single daemon msg received, exiting program MonitSlActOnMessageReceived
[22-Oct-2015_14:40:11] [CMUPthreadTools   ] [MonitRsKillThread] Thread_join failed <Invalid argument> CMUThreadJoin
[22-Oct-2015_14:40:11] [Shutdown Module   ] Stopping now HaltMyThreadsAndDie
[22-Oct-2015_14:40:11] [Shutdown Module   ] ------------ HaltMyThreadsAndDie

Armugam_Pradeep · ‎10-26-2015

Hi Rostislaw,

Please increase the monitoring debug level CMU_MAIN_MONITORING_DEBUG_LEVEL, CMU_SEC_MONITORING_DEBUG_LEVEL, CMU_SMD_MONITORING_DEBUG_LEVEL to 3 in /opt/cmu/etc/cmuserver.conf file and restart the monitoring on head node using below steps.

# /opt/cmu/tools/cmu_stop_monitoring
( Wait for few minutes )
# /opt/cmu/tools/cmu_start_monitoring

Please wait for few minutes and capture below logs.

From head node:
                /opt/cmu/log/MainMonitoringDaemon_<server_name>.log
From compute nodes where SecMD ran during this test;
                /opt/cmu/log/SecondaryServerMonitoring_<server_name>.log
                /opt/cmu/log/SmallMonitoringDaemon_<server_name>.log

Also, please send us the "/opt/cmu/etc/ActionAndAlertsFile.txt" file from management node and what is the output of "rpm --verify cmu" from management node.

Can you run the following command on a management node :
In below command, <compute-node> is secondary server node or any other node.

#time ssh <compute-node> hostname

We had seen issues with Monitoring where it is unable to start on compute nodes due to ssh login delays.
If ssh logins to any of the compute nodes take a long time (ie., 5 seconds), monitoring fails to start on those nodes.
Incorrect DNS/gateway settings on the nodes is the reason for such ssh login delays.
For more details, please refer "Section 5.25 Monitoring fails to start on compute nodes due to ssh login delays" in Insight CMU v7.3.2 Release Notes.

>PS: where can I open a support ticket for CMU?
Please raise a case by calling to Local HP Support Center.

What is the name of the customer? Also, let us know the details of the customer?

Regards,
Pradeep Kumar A.

rkrassow · ‎10-26-2015

hi Pradeep,

thank you for your hint to raise the debug level.

I found messages like the following in the MainMonitoringDaemon_asv11slt1.log:

[26-Oct-2015_11:03:25] [CMUResultReceiver ] Received data from <10.250.128.153> instead of <172.23.99.34>
Will send single_halt order to SEC monitoring daemon of node <10.250.128.153> immediately CheckStopCondition

172.23.99.34 - that is the configured ip in the management network. This IP is configured in the CMU for the compute node asv12slt1

10.250.128.153 - that is the traffic ip address of the same node. This IP is configured in the DNS enty for the node asv12slt1.

As workaround I change the node name in CMU to asv12slt1-mgm. Now the CMU IP address and the DNS entry match. The issue is resolved now.

However I still see the errors for data collection. I assume I have to adopt the ActionAndAlertsFile to RHEL 7 (e.g. due to different names for network interfaces than eth0, eth1). I attached the logs from the compute node with those errors.

Or can you provide me updated version of this configuration file for RHEL7?

The project is a proof of concept for an automotive customer in Germany. Please contact me HP internally if you like to have more details.

Many thanks for your support!

Armugam_Pradeep · ‎10-26-2015

Hi,

This warning "[Warning] Data conversion failed for Action <eth1_MB/s_tx>value was -->NOK" is expected.

Looks like your compute node NICs are not detected legacy nic naming like eth0, eth1 etc. I hope it follows RHEL7 persistent nic naming schemes like eno1,eno2,ens1,ens3f0, ens3f1,em1,em2 etc.

This RHEL 7 naming scheme affects some of the NIC-related monitoring metrics like
"eth0_MB/s_rx", "eth1_MB/s_rx", etc. Please add the appropriate network interface ACTION to the
/opt/cmu/etc/ActionAndAlertsFile.txt file and restart monitoring.

For more details, please refer Section 5.21 in Insight CMU v7.3.2 Release Notes.

There is no ActionAndAlerts File which have RHEL7 naming scheme. Please change the ACTION in the ActionAndAlertsFile.txt and restart monitoring.

Hope it helps.

Regards,

Pradeep A.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

SecondaryServerMonitoringDaemon cannot start on the compute node

SecondaryServerMonitoringDaemon cannot start on the compute node

Re: SecondaryServerMonitoringDaemon cannot start on the compute node

Re: SecondaryServerMonitoringDaemon cannot start on the compute node

Re: SecondaryServerMonitoringDaemon cannot start on the compute node