Server Management - Systems Insight Manager
1830014 Members
2066 Online
109998 Solutions
New Discussion

Incorrect Hardware Status

 
Gene Nordahl
Valued Contributor

Incorrect Hardware Status

I've just started testing HP SIM in our lab in preparation for migrating off IM7 and have run into an issue.

In our environment occasionally the HP agents and/or SNMP stop. In IM7 the hardware status becomes Unknown to alert us to an issue with the system. HP SIM on the other hand continues to report the hardware as Normal (green). As another step in my testing, to make sure it wasn't getting a "Normal" status from a source other than SNMP, I pulled a drive on the system with the agents and SNMP stopped. HP SIM still reported it as "Normal".
Has anybody seen this?? This is an issue that would stop us from going to HP SIM. These are W2K SP4 systems running 7.0 agents.

Any help would be appreciated.
17 REPLIES 17
Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

I would have thought the behaviour would be the same.
IM 7 would go from Green to a blue dot and Unknown, SIM would change to UnManaged but the Green Tick would remain. The difference is that there are now two states, Unknown and UnManaged. Unknown indicates SNMP is active but it doesn't recognise the device. UnManaged indicates there is no SNMP coimmunication. In both cases, the stopping of SNMP doesn't generate a Server Out alert, rather the change is identified in different places.
In both cases it would, I think, be the running of the Device Identification process that picked up the change in state, it may be the Auto-Discovery you'd need to test.
Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

That is correct the System Type does move into an unknown state when the agents are stopped or an unmanaged state when SNMP is stopped, when the device identification is run. IM7 would move it to an unkown status when doing a SNMP status polling.

Unfortunetly we only run device identification once a day, during off hours, where we run hardware status poling every minute. And as you probably know, a lot can go wrong when a system basically goes unmonitored (other than ping) for a day.

This raises a couple of questions from me. Why would HP SIM, when not recieving SNMP from a system it has identified as a server still consider that server in a normal state? Why did this change from IM7?

Thanks for your help on this.
Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

You can change the Hardware Polling Task to better represent what happened in IM 7.

The IM 7 Hardware Polling Task used SNMP, the HPSIM will try a number of protocols, SNMP, Ping, DMI.
You could remove Ping from the List of methods used to contact the Server to try and see if that registered the change.

Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

Rob,
We think too much alike. :-)

I actually have tried that. Actually removed everything but SNMP and it is still normal. I also removed everything but DMI, HTTP, and ping just to see and they all had the same result, the server is normal.

Thanks for the ideas.
David Claypool
Honored Contributor

Re: Incorrect Hardware Status

One thing to look at is when you click the system name in the table it goes to a details page. On the right hand side it shows the result for ping, snmp and Insight agents...
Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

When you click on the server the details page shows the Agents, SNMP and Ping are all Normal. Of course if you click on the agents you get the 404 error.
Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

Interesting, I'm not too sure what I've changed over the years with IM but, if I stop SNMP and it's dependeent services I get no notifications on either IM 7 or SIM.


Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

Ooopps yes it has. HW has gone from green to blue on IM 7.
David Claypool
Honored Contributor

Re: Incorrect Hardware Status

Error 404 is an HTTP error. However, none of the status polls that are being talked about are HTTP status polls (although the one from the agent seems like it might be because the interface we humans are used to using is the web interface).

Ping is self-explanatory. The SNMP poll is the same poll you would make of SNMP regardless of whether the Insight agents are running or not (basically doing an SNMP get of the system name OID and seeing if you get a response).

The poll identified as the agents is a neat little thing created 10 years ago which is an SNMP get of the OID of the "ProLiant Status Array." You may have noticed that it's really only for ProLiants that you get an indication of "minor." That's because within that array are indications of things like a failed device of redundant systems like drives in an array. This is unique to the Insight agents (as we get real Insight agents for some of the other HP systems--already available for Integrity SuperDome on HP-UX--they will have that ability also. Over the next several months, we'll be getting that for the rest of the "rx" Itanium line also (it arrives a little bit at a time as newly announced systems get it first and then it trickles down to previously announced platforms like the rx2600). This is one of the efforts sometimes referred to as the "ProLiantization" of Integrity.

hpSIM has the ability to do HTTP polling, but we don't have agents with the ability to be polled via HTTP yet. This is one of the things we're pursuing that will eventually allow you to eliminate SNMP if you wish to do so in the future (an item high on the customer wish list because of the perceived lack of security of SNMP). We won't eliminate the possibility of using SNMP altogether because that is what so many third party products that provide an interface to the Insight agents rely on to get their information.
Eric_76
Regular Advisor

Re: Incorrect Hardware Status

hey Rob
are you able to replicate Gene's issue?
Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

Eric,
Yes, I see the problem Gene is reporting, there's a change in behaviour and I thought it would be possible to rejig the polling tasks in HPSIM to replicate the old behaviour.
I shutdown SNMP plus Web Agents on a Server, the HW Status changed within about 5 minutes on CIM 7. On HPSIM the HW status never changed, despite me removing all but SNMP from the polling tasks and just went to UnManaged after the device identification task ran.

I need to reread what David has said, but I'm not sure how you can address the problem Gene has raised, if SNMP stops on a Server there's no easy way to pick it up with HPSIM.
Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

Gene,
I've not tried this, but you might want to follow it up.
There are two Hardware Polling tasks, Server and Non-Server. You'd need to remove ping etc. from both as when SNMP stops on the Server, the Server then moves to the non-Server list (as it is no longer recognised) where ping still reports it as okay.
Mike Angley
Advisor

Re: Incorrect Hardware Status

I do not know if it would help or not, but I run a "cold start" trap that tells me when the agents have restarted, such as in a reboot. This alerts me to servers that do a daily reboot that they have come back up on-line. Perhaps you can run something like this with an automatic task (batch file) that attempts to restart SMTP and the agents every 5 minutes or so. That way, if it was successful you would get the cold start trap and alert you as to the server needs to be checked.
Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

Rob,
I gave that a shot, but still no luck. All my hardware polling tasks are now only polling via snmp. Yet system status is still normal with snmp stopped. I even ran a device identification to change the system type to unmanaged, and still no change in status. It seems to me if it was only polling via snmp and snmp on a system was stopped, the system would go critical???

Mike,
I may have to resort to that, but I'd rather not add additional jobs on my servers to check snmp/agent service state, when it seems to me HP SIM should give us some indication when doing snmp polling (like IM7 does)that something is wrong.

Rob Buxton
Honored Contributor

Re: Incorrect Hardware Status

There's some consistency in behaviour, the CMS doesn't report the Server as critical in IM or SIM when SNMP is stopped.

My only guess is that when IM tries the SNMP connection it must get something back from the Server that indicates it is still alive.

Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

I'm would guess IM is also doing a ping (an upgrade over CIM). If I remember right in win32 CIM the server would go black if the agents weren't running. Guess that dates me a little on how long I've been using IM products for monitoring... :-)
Gene Nordahl
Valued Contributor

Re: Incorrect Hardware Status

Update.
Found out that you better not uncheck ping from your status polling. If you do and a server goes down, the HW status of the server remains normal (does not go critical until you poll it using ping or run a device ID). But hey with the server turned off at least the Agents and SNMP finally went into an unknown state.

So it appears that HW critical status is entirely driven by ping. I wonder what would happen if we have to block ICMP (ping)traffic on our network. All servers critical?