- Community Home
- >
- Servers and Operating Systems
- >
- Server Clustering
- >
- SecondaryServerMonitoringDaemon cannot start on th...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-22-2015 07:41 AM
10-22-2015 07:41 AM
hi all,
I cannot enable CMU monitoring in my cluster. After starting the monitoring engine on the CMU server the SecondaryServerMonitoringDaemon process cannot be started on the compute node. The process crashes after 5 seconds. As far as I understand CMU tries to restart the process every 30 seconds.
-> ssh-keys are distributed for user root
-> firewall is curently disabled
The installed versions and the logs are provided below. Can somebody give me a hint how too proceed here?
Thank you in advance, Rostislaw
PS: where can I open a support ticket for CMU?
Environment - CMU Server:
[root@asv11slt1 log]# uname -a
Linux asv11slt1 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@asv11slt1 log]# java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
[root@asv11slt1 log]# rpm -qa | grep cmu
cmu-7.3.2-1.x86_64
Environment - compute node:
[root@asv12slt1 log]# uname -a
Linux asv12slt1 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@asv12slt1 log]# java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
[root@asv12slt1 log]# rpm -qa | grep cmu
cmu_cn-7.3.2-1.x86_64
# CMU Server log
[root@asv11slt1 log]# tail MainMonitoringDaemon_asv11slt1.log
[22-Oct-2015_16:09:29] [CMUSecReelector ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:09:30] [CMUSecReelector ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:09:59] [CMUSecReelector ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:10:00] [CMUSecReelector ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:10:29] [CMUSecReelector ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:10:30] [CMUSecReelector ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:10:59] [CMUSecReelector ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:11:00] [CMUSecReelector ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
[22-Oct-2015_16:11:29] [CMUSecReelector ] Electing new SEC (asv12slt1 is no longer SEC for NE 0) MonitSrMainRoutine
[22-Oct-2015_16:11:30] [CMUSecReelector ] spawning asv12slt1 as new SEC for NE 0 MonitSrMainRoutine
# compute node logs
[root@asv12slt1 log]# cat SecondaryServerMonitoring_asv12slt1.log
[22-Oct-2015_16:10:04] [CMUstartup ] thread test [START]...
[22-Oct-2015_16:10:04] [CMUstartup ] thread test... [STOP]
[22-Oct-2015_16:10:04] [CMUFileLockTools ] mypid is 19306 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_16:10:04] [CMUstartup ] monitoring synchro is on
[22-Oct-2015_16:10:04] [CMUstartup ] monitoring memlock is off
[22-Oct-2015_16:10:04] [CMUstartup ] monitoring realtime priority parameter is 0
[22-Oct-2015_16:10:04] [CMUFileLockTools ] mypid is 19310 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_16:10:04] [CMUFileLockTools ] killing process 19306 : CMUKillDaemon
[22-Oct-2015_16:10:09] [CMUslaveListener ] Halt single daemon msg received, exiting program MonitSlActOnMessageReceived
[22-Oct-2015_16:10:09] [CMUPthreadTools ] Fatal, thread_cancel failed could not find thread CMUThreadCancel
[22-Oct-2015_16:10:09] [CMUPthreadTools ] [Fatal] Error while trying to kill thread MonitRsKillThread
[22-Oct-2015_16:10:09] [Shutdown Module ] Could not kill CS thread HaltMyThreadsAndDie
[22-Oct-2015_16:10:09] [Shutdown Module ] Stopping now HaltMyThreadsAndDie
[22-Oct-2015_16:10:09] [Shutdown Module ] ------------ HaltMyThreadsAndDie
[root@asv12slt1 log]# cat SmallMonitoringDaemon_asv12slt1.log
[22-Oct-2015_14:40:11] [CMUstartup ] Entering checkBigRegexpBuggy
[22-Oct-2015_14:40:11] [CMUstartup ] size tested is 21000 checkBigRegexpBuggy
[22-Oct-2015_14:40:11] [CMUstartup ] Entering checkRedhat8Bug
[22-Oct-2015_14:40:11] [CMUstartup ] Entering checkThreadLibrary
[22-Oct-2015_14:40:11] [CMUstartup ] thread test [START]...
[22-Oct-2015_14:40:11] [CMUstartup ] thread test... [STOP]
[22-Oct-2015_14:40:11] cmuconf =(null)
,cmuconf_compl =(null)
,cmu_cluster_conf=(null)
,AAFile =/opt/cmu/etc/ActionAndAlertsFile.txt
,MAFile =(null)
,debugLevelMMD =(null)
,debugLevelSEC =(null)
,debugLevelSMD =1
,timestep =5000000
,master_host_ip =172.23.99.34
,sec_ip =(null)
,incomingSLPort =48557
,outgoingMSRRPort=48560
,outgoingMSHellop=49074
,do_synchro =1
,do_memlock =0
,do_realtime =0
,cmu_mgt_node_ip =172.23.99.16
,host_ip =172.23.99.34
,nodes_file_path =(null)
[22-Oct-2015_14:40:11] [CMUFileLockTools ] mypid is 1293 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_14:40:11] [CMUstartup ] monitoring synchro is on
[22-Oct-2015_14:40:11] [CMUstartup ] monitoring memlock is off
[22-Oct-2015_14:40:11] [CMUstartup ] monitoring realtime priority parameter is 0
[22-Oct-2015_14:40:11] [CMUFileLockTools ] mypid is 1293 CMUGetMonitoringDaemonLockFile
[22-Oct-2015_14:40:11] [SmallMonitorDaemon] not starting collectl client Main
...
[22-Oct-2015_14:40:11] [(4)CMUResultSender] Could not extract numerical part of <NOK 14
> MonitRsConvertToDouble
[22-Oct-2015_14:40:11] [(4)CMUResultSender] [Warning] Data conversion failed for Action <eth1_MB/s_tx>value was -->NOK 14
<--, fix /opt/cmu/etc/ActionAndAlertsFile.txt
[22-Oct-2015_14:40:11] [CMUslaveListener ] Halt single daemon msg received, exiting program MonitSlActOnMessageReceived
[22-Oct-2015_14:40:11] [CMUPthreadTools ] [MonitRsKillThread] Thread_join failed <Invalid argument> CMUThreadJoin
[22-Oct-2015_14:40:11] [Shutdown Module ] Stopping now HaltMyThreadsAndDie
[22-Oct-2015_14:40:11] [Shutdown Module ] ------------ HaltMyThreadsAndDie
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2015 12:46 AM
10-26-2015 12:46 AM
SolutionHi Rostislaw,
Please increase the monitoring debug level CMU_MAIN_MONITORING_DEBUG_LEVEL, CMU_SEC_MONITORING_DEBUG_LEVEL, CMU_SMD_MONITORING_DEBUG_LEVEL to 3 in /opt/cmu/etc/cmuserver.conf file and restart the monitoring on head node using below steps.
# /opt/cmu/tools/cmu_stop_monitoring
( Wait for few minutes )
# /opt/cmu/tools/cmu_start_monitoring
Please wait for few minutes and capture below logs.
From head node:
/opt/cmu/log/MainMonitoringDaemon_<server_name>.log
From compute nodes where SecMD ran during this test;
/opt/cmu/log/SecondaryServerMonitoring_<server_name>.log
/opt/cmu/log/SmallMonitoringDaemon_<server_name>.log
Also, please send us the "/opt/cmu/etc/ActionAndAlertsFile.txt" file from management node and what is the output of "rpm --verify cmu" from management node.
Can you run the following command on a management node :
In below command, <compute-node> is secondary server node or any other node.
#time ssh <compute-node> hostname
We had seen issues with Monitoring where it is unable to start on compute nodes due to ssh login delays.
If ssh logins to any of the compute nodes take a long time (ie., 5 seconds), monitoring fails to start on those nodes.
Incorrect DNS/gateway settings on the nodes is the reason for such ssh login delays.
For more details, please refer "Section 5.25 Monitoring fails to start on compute nodes due to ssh login delays" in Insight CMU v7.3.2 Release Notes.
>PS: where can I open a support ticket for CMU?
Please raise a case by calling to Local HP Support Center.
What is the name of the customer? Also, let us know the details of the customer?
Regards,
Pradeep Kumar A.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2015 09:32 AM
10-26-2015 09:32 AM
Re: SecondaryServerMonitoringDaemon cannot start on the compute node
hi Pradeep,
thank you for your hint to raise the debug level.
I found messages like the following in the MainMonitoringDaemon_asv11slt1.log:
[26-Oct-2015_11:03:25] [CMUResultReceiver ] Received data from <10.250.128.153> instead of <172.23.99.34>
Will send single_halt order to SEC monitoring daemon of node <10.250.128.153> immediately CheckStopCondition
172.23.99.34 - that is the configured ip in the management network. This IP is configured in the CMU for the compute node asv12slt1
10.250.128.153 - that is the traffic ip address of the same node. This IP is configured in the DNS enty for the node asv12slt1.
As workaround I change the node name in CMU to asv12slt1-mgm. Now the CMU IP address and the DNS entry match. The issue is resolved now.
However I still see the errors for data collection. I assume I have to adopt the ActionAndAlertsFile to RHEL 7 (e.g. due to different names for network interfaces than eth0, eth1). I attached the logs from the compute node with those errors.
Or can you provide me updated version of this configuration file for RHEL7?
The project is a proof of concept for an automotive customer in Germany. Please contact me HP internally if you like to have more details.
Many thanks for your support!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2015 11:49 PM - edited 10-27-2015 01:58 AM
10-26-2015 11:49 PM - edited 10-27-2015 01:58 AM
Re: SecondaryServerMonitoringDaemon cannot start on the compute node
Hi,
This warning "[Warning] Data conversion failed for Action <eth1_MB/s_tx>value was -->NOK" is expected.
Looks like your compute node NICs are not detected legacy nic naming like eth0, eth1 etc. I hope it follows RHEL7 persistent nic naming schemes like eno1,eno2,ens1,ens3f0, ens3f1,em1,em2 etc.
This RHEL 7 naming scheme affects some of the NIC-related monitoring metrics like
"eth0_MB/s_rx", "eth1_MB/s_rx", etc. Please add the appropriate network interface ACTION to the
/opt/cmu/etc/ActionAndAlertsFile.txt file and restart monitoring.
For more details, please refer Section 5.21 in Insight CMU v7.3.2 Release Notes.
There is no ActionAndAlerts File which have RHEL7 naming scheme. Please change the ACTION in the ActionAndAlertsFile.txt and restart monitoring.
Hope it helps.
Regards,
Pradeep A.