Server Management - Systems Insight Manager
1753937 Members
10266 Online
108811 Solutions
New Discussion юеВ

To many "Cluster Monitor Status Change" events !!!

 
Ross Humphryes
Frequent Advisor

To many "Cluster Monitor Status Change" events !!!

Hi Folks

We have recently updated our HP SIM server from 4.1 to 4.2 SP2 and updated all our cluster nodes to HP Insight Management Agents for Windows 2000/Server 2003 7.30.0.0.

We now get numerous events from the cluster monitor agent, such as the following:

Event Name: Cluster Monitor Status Change - Node
Event originator: server01
Event Severity: Critical
Event received: 17-Oct-2005, 14:54:59
Event description: Node

From the the following sources:

CMX TEXT: Disk Resource:
CMX TEXT: CPU Resource:
CMX TEXT: System Resource:

Now, I understand the the DISK and CPU sources are thresholds which we can adjust but what are the System events?

The thresholds we have are the default and 95% of the time I drill down into SIM to see what is causing the event the system shows normal, which leads me to think that either teh event was generated in error or I am not quick enough.

Some of the events are genuine but most seem to be malfunctioning and I would like to fix these or disable them.

1. Any ideas ?
2. Any ideas on how to disable these without
making the cluster agent inactive?
3 Any ideas why we didnt see these events
prior to epgraded??


regards
Ross
Get in my belly
9 REPLIES 9
cindy schlener
New Member

Re: To many "Cluster Monitor Status Change" events !!!

Hi Ross. I'm the Cluster Monitor engineer and unfortunately I'm not sure why you're getting all those events since upgrading. However it may have to do with the HP insight mgmt agents. If you never had these events before but now you do, it's possible that the SNMP agents were not installed on the cluster nodes that allowed Cluster Monitor to get at the information.
We get the CPU and Disk resources from one SNMP agent and we get the System health from another SNMP agent. The system event comes from the system health value which "should" be the same value as if you went to that cluster node's home page at http://ipaddress:2301.

If you want to see the actual disk and CPU values that we have for that particular cluster node, in HPSIM, click on the cluster link in either the system or cluster list. You can run the All Clusters query and then click on one of the MSCS clusters and that will bring up Cluster Monitor. From that point, you can expand the cluster tree to show the cluster nodes and the Resources (such as MSCS, CPU, DISK and SYSTEM). You'll see the actual CPU and DISK values along with the threshold values (that are then compared with the actual values and an event is created if the particular threshold is exceeded).
You can change those threshold values (and hence stop getting some of the events) by going under Options -> Cluster Monitor -> Node Resource Setting and changing the threshold values for CPU and disk.

Unfortunately, a critical status usually means a cluster node is down when Cluster Monitor did its polling. You can decrease how many events you get by increasing the polling cycle for all the resources. Go to Options -> Cluster Monitor -> Cluster Resource Settings (deals with the MSCS resource) and Node Resource Settings (deals with the CPU, disk and system resoures) to change the polling rates (it's measured in minutes and the default is 5 minutes). For Node Resource Settings, make sure you use the value of ALL for the cluster choice. Then for each resource, change the polling value. That should help you out somewhat.

Unfortunately, you can not disable these events. Increased polling values should help though.
Cindy (cindy.schlener@hp.com)

Ross Humphryes
Frequent Advisor

Re: To many "Cluster Monitor Status Change" events !!!

Hi Cindy

First of all, Thanks very much for taking the time to respond to my issue. You're information is appreciated and very useful but obviously it does not answer the problem.

Perhaps we can discuss the problem a little further.

I have adjusted the polling interval already to 60mins and this has reduced the amount of alerts as expected.

After receiving an alert I got to investigate. The Cluster links all appear to work fine and everything is green, normally.
However, when I try to attach to the system management homepage of either of the nodes things are not quite right, no hardware data is returned at all.

The actual URL to the homepage is HTTPS://ipaddress:2381/

When we jump to the system management homepage (SMH) of other servers we normally get the expected page of hardware data returned but on some of the systems we get no data returned, by this I mean we appear to be connected to the SMH but there is no hardware information listed at all.

I am wondering if there is a link between the issue with the Cluster Monitor issue and the inability to successfully get data from the SMH.

Kind regards
Ross
Get in my belly
Graham Land
Regular Advisor

Re: To many "Cluster Monitor Status Change" events !!!

Hi Ross,
Just to let you know that you're not alone. We're currently deploying PSP7.30 to our servers and using HPSIM 4.2 SP2 and I'm seeing these same events.
I logged a call with HP but didn't really get anywhere. They suggested that when everything is upgraded it may resolve itself??
I'll keep a watch on this post.
Graham.
cindy schlener
New Member

Re: To many "Cluster Monitor Status Change" events !!!

During cluster identification, we determine whether or not CPU, disk, system and MSCS resources are available by actually doing an SNMP get to the appropriate SNMP attribute. If we can't get the information, then for that cluster or cluster node, you don't get the resource (disk, CPU, system). So in cluster identification it was determined that the agents were up and we were able to get at the information. Then at some point in time, that information became unavailable. It's possible that the SNMP service is down so that's why the state of the event is critical. If you want to you can always delete the cluster and nodes and then rerun discovery and then see what resources are available to your cluster (if SNMP is down and has to stay down then this will prevent the events from showing up).

If the events are warning or major level then you're dealing with a threshold issue with disk or CPU.
Cindy
Ross Humphryes
Frequent Advisor

Re: To many "Cluster Monitor Status Change" events !!!

Hi Cindy

Thanks very much for your reply.

I'll try to address each point you raised.
The setup of our SNMP services are as follows:
* The "SNMP Service" service on all our
servers is set to automatic and is up
and running. Here we have the public
community set and trap destination of
the SIM server.
* The "SNMP Trap Service" service is set
to manual and is stopped.

* Were we have clusters we have no problem
seeing the cluster links.
* When we get a critical alerts we bring
up the MSCS resource gui but we see
green on CPU, SYSTEM and DISK which
is odd as this does not correspond
to the critical alerts from these
systems.
* However, we also get valid cluster
status alerts and these are normally
minor or major level and these match up
to thresholds that have been breached.

* As for the SystemManagmentHomepage data
we get. On most systems we see a full
list of hardware inventory which we
expect but on some systems we see no
hardware inventory and I dont seem to
be able to figure out the differance
in the systems that show good info and
the systems that show no info.

Cindy, would it be possible for you to give me a list of prerequisits for our INSIGHT agents? Any SNMP requirments/recommendations would be useful too. This will help me establish if my SNMP agents and INSIGHT agents are set correctly.

It seems that there are a number of customers that are experiencing this problem and dont have any clue as to why so if we get any resolution I shall post it here.

Any further help would be appreciated.

Regards
Ross
Get in my belly
Ross Humphryes
Frequent Advisor

Re: To many "Cluster Monitor Status Change" events !!!

Hello again Cindy

Just to confirm... it seems that my two problems seem to be associated with each other.

Each of the cluster nodes that I get my "bogus" critical Cluster Monitor Status alert from also has the problem where I dont see any hardware inventory if I connect to its SystemManagmentHomepage. The MSCS gui however looks just fine though...

regards
Ross
Get in my belly
David Branca
New Member

Re: To many "Cluster Monitor Status Change" events !!!

Same problem here but we are running SIM 5.0. The system management pages work fine for the servers experiencing the problems, but we can't get any detail on the clusters.

The cluster status actually never change on the server, but SIM sees it differently.

Any ideas?
cindy schlener
New Member

Re: To many "Cluster Monitor Status Change" events !!!

Hi folks. This is a 2 part answer that I hope, will answer your questions. Also, thanks to Ross for testing out a solution to the agent issue.

--------- Agent Issue ---------
Here is a message from a manager in the agent group on how to fix the agent issue on the cluster nodes.
"There is an issue with the System Management Homepage and clusters in that it does not show up on the shared cluster-IP after a failover but in that case you will not see the homepage at all. If what they see is the System Management Homepage but it is basically blank with little or no data items populated then it is probably Agent/SNMP related. A default install of the PSP will install all the necessary agents and drivers - however if they don't want to do a default install, they probably want to install at least the server agent, the foundation agent, the NIC agent, and the storage agent. There is also what is commonly called the "health" driver, it normally is a requirement for the server agent to work properly, there may be several of these - if they know which one for the system they are running they will need to install it also. Again, a default install of the PSP usually insures that all the right software gets landed.

On top of that if they are running Windows 2003, the default install of the OS usually does not land SNMP unless they are using an assisted install from the Proliant Smart Start CD. They probably want to make sure that the SNMP service is installed and running. The agent documentation should list the proper SNMP configurations necessary, but if they are using HPSIM they can try and run "Repair or Configure Agent Settings" or if they have systems that appear to be working they can use the Replicate Agent Settings in HPSIM against the systems that are not working. These tasks are usually helpful in fixing broken agent settings - including SNMP settings."

Ross did the following which did solve the agent problem -
1) Deleted the devices from SIM
2) Reinstalled the Proliant Support Pack using ver 7.40 and then rebooted.
3) Repaired agent

--------- Cluster Status Issue ---------

Cluster Status (as shown in the cluster lists) is made up of the following information) - cluster status shown in the Cluster Monitor page, disk and CPU threshold events and system events. Prior to v5.0, the Cluster Monitor showed the cluster and the cluster and node resources for that cluster. The cluster status was the worse case status of the cluster (MSCS) and node (disk, CPU, system) resources. So it was pretty easy to see how the cluster status was computed.

In v5.0 HPSIM, the cluster status is still computed that way, but it's not as easy to see where it comes from. Now the Cluster Monitor tool is actually the old MSCS cluster resource. To determine what the values are for the disk, CPU and system resources, you have to go the event lists for that cluster and see the Cluster Monitor events. If the status in the Cluster Monitor page is green but there is a disk threshold event that is major, then the cluster status shown in the cluster lists is orange (major).

I hope this answers the questions asked here.
Cindy
Max Maklin_2
New Member

Re: To many "Cluster Monitor Status Change" events !!!

To Cindy....

We are running 7.30A agents on our MSCS cluster. We are running SIM 5.0 management station. The cluster status shows the cluster in the minor/yellow state, yet all underlying status are normal/green. This includes the status in cluster monitor (and all resources in the tabs in cluster monitor). Under system events for the cluster itself there were events about crossing thresholds, but they have been cleared. We have reinstalled the agents, re-discovered the nodes, but the state is always minor. Anymore ideas?