HPE 9000 and HPE e3000 Servers
1819924 Members
3340 Online
109607 Solutions
New Discussion юеВ

Re: Over temperature alarms handling & supported platforms

 
SOLVED
Go to solution
Rui Vilao
Regular Advisor

Over temperature alarms handling & supported platforms

Greetings,

I had a look at some previous posts on this subject and could learn a lot!
Great reference!

However I am still missing two points.

1. On which HW platforms is this overtemp warning supported? Our customer
has RP7405 and RP5405.

2. Is it possible to trigger some command when the temperature is back to nornal?

I have tried the following entry in /etc/envd.conf

TEMP_NORMAL:y
/opt/ASG/normal.sh

... but it does not work when I test it with:

# echo '\000\c' > /var/run/envd_diag


Any help/suggestion is highly appreciated.

TIA.

Kind Regards,

Rui.
"We should never stop learning"_________ rui.vilao@rocketmail.com
6 REPLIES 6
Paula J Frazer-Campbell
Honored Contributor

Re: Over temperature alarms handling & supported platforms

Hi

You can utilise the /etc/envd.conf file to warn you of Overtemps.

THe /sysadmin/team file just contails an information line.


Path should be in full:-


OVERTEMP_CRIT:y
/usr/bin/mailx -s "WARNING SERVER GETTING WARM" 07956610410@one2one.net
OVERTEMP_EMERG:y
/usr/sbin/reboot -qh

FANFAIL_CRIT:y
/usr/bin/mailx -s "WARNING SERVER FAN PROBLEMS" 07956610410@one2one.net
FANFAIL_EMERG:y
/usr/sbin/reboot -qh

The file /sysadmin/team just contains a frew line of info txt

Paula
If you can spell SysAdmin then you is one - anon
Zeev Schultz
Honored Contributor
Solution

Re: Over temperature alarms handling & supported platforms

1)Its hardware platform dependant,L-class (rp54xx) and N-class (rp74xx) are capable of
this ability.
2)Not by usual means.It reports to syslog (And whatever defined in envd.conf) but doesn't repot when status is back to normal.
However...

NORMAL 0x0 = 000
OVERTEMP_CRIT 0x1 = 001
OVERTEMP_EMERG 0x2 = 002
FANFAIL_CRIT 0x4 = 004
FANFAIL_EMERG 0x5 = 005

To simulate:
OVERTEMP_CRIT event:
# echo '\001\c' > /var/run/envd_diag

After any tests, set the condition back to NORMAL:
TEMP_NORMAL event:
# echo '\000\c' > /var/run/envd_diag

envd_diag is a fifo file (works like stdout?)
So I'd do traces for envd (use tusc -vpf) and
learn its behaviour,but it's hardly a workaround :)

Zeev
So computers don't think yet. At least not chess computers. - Seymour Cray
Zeev Schultz
Honored Contributor

Re: Over temperature alarms handling & supported platforms

1)Its hardware platform dependant,L-class (rp54xx) and N-class (rp74xx) are capable of
this ability.
2)Not by usual means.It reports to syslog (And whatever defined in envd.conf) but doesn't repot when status is back to normal.
However...

NORMAL 0x0 = 000
OVERTEMP_CRIT 0x1 = 001
OVERTEMP_EMERG 0x2 = 002
FANFAIL_CRIT 0x4 = 004
FANFAIL_EMERG 0x5 = 005

To simulate:
OVERTEMP_CRIT event:
# echo '\001\c' > /var/run/envd_diag

After any tests, set the condition back to NORMAL:
TEMP_NORMAL event:
# echo '\000\c' > /var/run/envd_diag

envd_diag is a fifo file (works like stdout?)
So I'd do traces for envd (use tusc -vpf) and
learn its behaviour,but it's hardly a workaround :)

Zeev
So computers don't think yet. At least not chess computers. - Seymour Cray
Rui Vilao
Regular Advisor

Re: Over temperature alarms handling & supported platforms

Many thanks to Paula & Zeev.

If I understand well it is not possible to run some script defined in envd.conf when the temperature condition is back to normal...

TIA,

Kind Regards,

Rui.
"We should never stop learning"_________ rui.vilao@rocketmail.com
Tobias Hartlieb_1
Occasional Advisor

Re: Over temperature alarms handling & supported platforms

Hi,

I'm not sure, but maybe there is a workaround. I have never tested, but maybe you would like to.
These events (temperature hot, temperature back to normal) are usually also detected (I think via envd ..) and reported by the EMS (Event Monitoring System, part of the OnlineDiagnostic, free of charge Software).
Within EMS, the dm_core_hw monitor is responsible for the temperature (as well as fan states, etc..).
To my knowledge, the "back to normal temperrature" event is of "informational" severity only, and thus gets only written by default to /var/opt/resmon/log/event.log However, by using /etc/opt/resmon/lbin/monconfig, one could trigger dm_core_hw to additionally send "informational" events to another textlog file, or to any Email adress...
Checkout http://www.docs.hp.com/hpux/diag/index.html#EMS%20Hardware%20Monitors%20(for%20HP%209000) for additional Infos on EMS.

Regards.

Tobias
Bill Hassell
Honored Contributor

Re: Over temperature alarms handling & supported platforms

I would be VERY concerned about using the computers in an overtemp condition. While the rp-class computers do indeed have an overtemp sensor, this is for emergencies. When an overtemp condition exists, the rest of the system (disks, tape drives, network devices, etc) may already be damaged. The rp machines will turn themselves off before they are damaged but you may lose disks and other peripherals. The problem with heat damage is that it is cumulative, that is, reliability and intermittant failures begin to increase leading to loss of data and/or system crashes.

The cost of adequate air-conditioning AND overtemp power disconnect is insiginificant when compared to the downtime, troubleshooting and repair of equipment damaged by overtemp conditions. Using the computer as a substitute for a hightemp power disconnect is not a good idea.


Bill Hassell, sysadmin