HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

rp2430 reboot after overtemp detected by envd

 
thomas gourdin
Occasional Contributor

rp2430 reboot after overtemp detected by envd

Hi,

on our criticals servers, we've got many temperature problems, when envd detects an overtemp, it executes a reboot but i haven't got any alarm before, the temp in the room is good (22 °C) and all the cabinet fan are ok.

OLDsyslog.log :

Oct 3 04:23:33 poseidon /usr/sbin/envd[1910]: ***** OVERTEMP_EMERG WARNING *****

but in stm, all the fans seems to be good ...
i think it's the blower fan but i can't test it.

see the zipped logs in attachement.

Thanks

Thomas GOURDIN
System & Database Administrator
LATelec - LATecoere Gro
5 REPLIES 5
Michael Steele_2
Honored Contributor

Re: rp2430 reboot after overtemp detected by envd

STM is not the GSP.

Call HP for a HW call.

Rule of thumb, any GSP alert over 11 or 12 is a HW call.
Support Fatherhood - Stop Family Law
John Waller
Esteemed Contributor

Re: rp2430 reboot after overtemp detected by envd

Don't know if you have got to the bottom of this, but you mentioned that the temp was 22 degrees, but was that the tempurature at 04:22 in the morning when the server failed. I have known in the past some companies to have their enviromental control on timers which switch off when the building is empty.
You have not mentioned which version of O/S you are running but under HP-UX 11.00 EMS sometimes reports tempurature problems with servers. This is normally installed via the Support Bundle (OnlineDiag)
Alexander M. Ermes
Honored Contributor

Re: rp2430 reboot after overtemp detected by envd

Hi there.
At the time of the alarm the temperature must have been higher than usual.
We have the same kind of reaction from our old V2500 and other HP-servers.
Reason could be a powerdown of the air condition or so.

Sample extract from /etc/envd.conf
:

OVERTEMP_CRIT:y
/usr/bin/mailx -s OverTemp-Detected abc@xyz.com

OVERTEMP_EMERG:y
/usr/bin/mailx -s OverTemp-Detected-reboot abc@xyz.com
/usr/sbin/reboot -qh

FANFAIL_CRIT:y
/usr/bin/mailx -s Fan-fail-detected abc@xyz.com


FANFAIL_EMERG:y
/usr/bin/mailx -s Fan-fail-detected-reboot abc@xyz.com
/usr/sbin/reboot -qh


Rgds
Alexander M. Ermes

.. and all these memories are going to vanish like tears in the rain! final words from Rutger Hauer in "Blade Runner"
Stefan Stechemesser
Honored Contributor

Re: rp2430 reboot after overtemp detected by envd

Hi,

this looks really like a hardware problem. Unfortunately, you did not attach the complete error log. The question is: what happened BEFORE the envd shut down the system.
In the bootlog.txt there are multiple fan failures reported, unfortunately without a timestamp. The question is: Did the diagnostic really capture the actual bootlog ? It is better to make a lookup directly on the GSP. Press CTRL-B on the console, login by pressing Enter two times, then "sl" and choose the error log. If the fan failures are actual, it maybe that you have either two bad fans (fan 0 and 3) or the platform monitor only thought the fans were running too slow and set them in status failed. Because your machine is running now, it seems to be clear that the fans are not really defect. The fan failure may also be caused by the overtemp, because the fans are running faster on higher temperature and the chance of a fan failure increases.
You should check this with your local response center.
Philippe LESPINE
Occasional Visitor

Re: rp2430 reboot after overtemp detected by envd

Hi everybody,

I know the stm is not the GSP .
My system is an HPUX 11.11.


the GSP said : insufficient number of fans, but this information is too generic
with stm, syslog and other logs, i determined the exact problem: blower 0 failure .
I check this whith my local response center and someone came to change the power fans and cpu/cpu fans and it works for the moment .
after the upgrade of the support tools manager, we've tested the system and all the fans seems to be good. thanks for all.

Have a nice day,

Thomas GOURDIN
System & Database administrator
LATelec - LATecoere Group