1825764 Members
2064 Online
109687 Solutions
New Discussion

HW Event Notification

 
SOLVED
Go to solution
Doug_3
Frequent Advisor

HW Event Notification

Hello, I am interested if we can obtain a greater level of detail in our hp-ux EMS hardware notification for temperatures. We page/email when the temp reaches the default setting but the msg is too generic (see below).

Can someone point me to the correct config file where the settings are maintained or let me know if these are hardcoded and we are not able to gain more info from the standard EMS processes.

We are looking for the actual cabinet temp at the time of the notification, what the shutdown temp is, etc.

Thanks in advance,
Doug

Event Time..........: Thu May 31 20:12:26 2007
Severity............: CRITICAL
Monitor.............: dm_core_hw
Event #.............: 33
System..............: IFASHP.spokaneschools.org

Summary:
Processor cabinet intake temperature is too hot
14 REPLIES 14
A. Clay Stephenson
Acclaimed Contributor

Re: HW Event Notification

I cringe every time I see a question like yours because the real answer is to fix the problem --- inadequate cooling. Some models do allow querying the temperature but the vast majority can only issue fixed status messages. Moreover, there is no general approach that will work across all models AND you are ignoring other components such as disk arrays, network switches, and other peripherals.

There is an instrument that is designed to do this task; I think it is called a "thermometer".
You can find digital thermometers that either have a serial interface and some are even web-enabled. Most will allow for multiple temperature probes. This is the approach I would take and then you can remotely measure temperature regardless of the equipment.

Of course, the real answer is to have adequate N + 1 cooling so that you can lose an entire HVAC unit and your equipment doesn't fail.

If it ain't broke, I can fix that.
Scot Bean
Honored Contributor

Re: HW Event Notification

The overtemp settings are generally hard coded in the machine firmware. I would not recommend trying to change them, unless you want to fry your box.

If you tell us the model this is, someone could maybe find a spec.

You can also see a bit more detail via the console interface to the firware/support processor (cntl-B) via the 'PS' (power status) command. It tells you which threshold you are at.

Event #33 is the first warning threshhold. If you get even hotter the machine should shut itself off.
Doug_3
Frequent Advisor

Re: HW Event Notification

Thanks, but that was not what I was asking. I want to know what the internal chassis temp is set to when STM/EMS generates an event triggering whatever actions we have set in the configuration. I also want to know if the temp reading is hard coded or if we can include that in the EMS notification.

We do have N+1 cooling and we have temp gauges on other hardware as well as HVAC notifications.

Thanks anyways.
OldSchool
Honored Contributor

Re: HW Event Notification

the firmware / hardware *generates* the event AFAIK. EMS simply reports it. There is nothing to / can be configured

man 1m dm_core_hw for more
Scot Bean
Honored Contributor

Re: HW Event Notification

If you share with us the model of the machine, someone may be able to look up the specs.
Doug_3
Frequent Advisor

Re: HW Event Notification

Thank you,
rp7400 A3639C
Scot Bean
Honored Contributor
Solution

Re: HW Event Notification

Looks like the specs for rp7400 are warning at 35C, shutdown (ungraceful) at 40C.

Of course these temps are inside the cabinet, NOT the computer room air. Also, these temps are probably measured at +/- 2 degrees C or so, they can vary.
Bill Hassell
Honored Contributor

Re: HW Event Notification

The computer hardware has a two stage thermometer: too warm (warning) and way too hot (critical). There is no thermometer, no readout, nothing but these two levels. Your computer may have shut itself down but everything else is frying. Your computer room is way, way too hot if either message is reported -- and if no one is within a few minutes of the computer room so they can hit the panic button to shutdown all power to the room, damage has already occurred. Proper temperature control seems to be a very low priority until it is too late.

I have personally witnessed over $100,000 in damage when an air conditioner (just one) was turned off by a timer Sunday afternoon and the internal temperature went to an estimated 140 degrees. Four disk drives were destroyed, a tape drive, all the networking and several computers without overtemp shutdown were damaged beyond reliable repair. This company ignored requests for separate, dual air conditioners and instead spliced some ductwork off the building system into the computer room. The $100k was just hardware -- downtime was several weeks.


Bill Hassell, sysadmin
Paul Clark_9
Advisor

Re: HW Event Notification

Scot Bean,

Would it be possible to find out the temp thresholds for an rp8420? We have a similar problem occuring at the moment and need to understand this a bit more.

Regards
Torsten.
Acclaimed Contributor

Re: HW Event Notification

@Paul:

Operating Temperature for rp8420 is 41° to 95° F (5° to 35° C)

You can check the inlet temperature from MP (DE command).

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   

Re: HW Event Notification

Torsten,

Would you also happen to know where the temperature margins lie, when it reaches Critical and then Emergency. I'm guessing the Critical warning comes in at the 35 degrees C but at what temperature would the Emergency kick in and for the machine to shutdown.

This is for an rp8420.

Regards

Matt / Paul
Torsten.
Acclaimed Contributor

Re: HW Event Notification

First check the temperature measured by the system:

MP:CM> de

Display status of the selected MP entity (for use by trained personnel only)

B - BPS (Bulk Power Supplies)
U - CLU (Cabinet Utilities: Fans, Intrusion, Clock's etc.)
A - PACI (Partition Console Interface)
G - MP (Management Processor)
P - PM (Power Management)
H - Cell Board Controller (PDHC)
Select device: u

Cabinet 0 Utilities Status
FW Revision : 8.005 built Sep 26 2006 at 16:11:20

PWR SBY MP RUN REM ATT FLT
Front Panel LED State : * * * . . . .

Inlet Air Temperature : 20 deg C


You should consider everything above 30° C as critical, IMHO.

From the system point of view, an internal temp below 35° C is normal. Above this point it will create warnings and speed p the fans. This sounds like a starting plane.

The values are similar to the values mentioned above. Be aware of some tolerances while measuring (done inside the box!).

BTW, the system will perform a "reboot -h (-q?)".


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Bill Hassell
Honored Contributor

Re: HW Event Notification

> We have a similar problem occuring at the moment and need to understand this a bit more.

Unless the computer is generating a bogus alert, *ANY* warning is too much, requiring immediate action (within minutes). There is nothing to understand except that the computer is way too hot. The next action steps are easy. Start an immediate shutdown of non-essential systems and peripherals, then an emergency redesign of the cooling systems with redundancy and electronic monitoring of water, temperature and power. With rp8420's running $200k to over $1 million (each), it doesn't take too much to understand that the equipment is at serious risk. It's sort of like wanting to understand whether a fire in the computer room has reached 1000 degrees or 1500 degrees.


Bill Hassell, sysadmin
Doug_3
Frequent Advisor

Re: HW Event Notification

thank you.