HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

System Temperature is at non-recoverable level.

 
SOLVED
Go to solution
Amit_Mohabanshi
Occasional Advisor

System Temperature is at non-recoverable level.

Hi All,

We have get below warning message from EMS.

 

****************************

EMS Eevnt: "MAJORWARNING (3)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 212271106 -r /system/events/ia64_corehw/core_hw -n 212271105 -a 

 

****************************

 

 

After checking error details we found that this is replated to system temprature.

 

****************************

CURRENT MONITOR DATA:

Event Time..........: Mon Sep 26 14:13:14 2011
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011
System..............: drctweet

Summary:
System Temperature is at non-recoverable level.


Description of Error:

The system temperature is not within normal operating range. It is higher
than required operating range.

Probable Cause / Recommended Action:

Something may be blocking the cooling intakes of the fans. Check for
obstruction.
One or more fans may be operating at lower speed than normal. Check the
fan performance.

Check for problems with the room air conditioning.

If the problem is not fixed, the operating temperature may become
non-recoverable, in which case there are chances that the hardware may be
damaged. At that temperature level, the action specified in the envd
configfile will be taken - which may be to shutdown the system
automatically.

For information on the sensor that generated this event, refer to FRU ID
in Event Details section.

Additional Event Data:
System IP Address...: 192.168.5.101
Event Id............: 0x4e80c0ba00000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_ia64_corehw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/rp4440
EMS Version.....................: A.04.00
STM Version.....................: A.43.00
System Serial Number............: USE45358BL
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/ia64_corehw.htm#101011

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v


Event Details :

Event Date .............: Mon Sep 26 14:13:10 2011
Sensor Number ..........: 0xd7
Sensor Type ............: Temperature
Sensor Class ...........: Threshold based
Sensor Reading/Offset...: 0x07 (Offset)
Event Type.............: Assertion
Entity ID ..............: 23
Generic Message.........:
Temperature : Upper non-critical - going high
Entity FRU Id Info......:
system chassis (Sensor ID: FP Ambient Temp)
****************************

 

Here we need to check the fan performance status but I am not able to execute print_manifest command. could anyone please tell me how can I check system temprature or current fan parameters?

 

Please help to find out for possible resolution for this problem as it could lead to major hardware failure. All kind of suggestion will be appreciated. Please let me know if require any other details.

 

----

Thanks

Amit Mohabanshi

 

6 REPLIES
James R. Ferguson
Acclaimed Contributor

Re: System Temperature is at non-recoverable level.


Amit_Mohabanshi wrote:

Here we need to check the fan performance status but I am not able to execute print_manifest command. could anyone please tell me how can I check system temprature or current fan parameters?

 

Please help to find out for possible resolution for this problem as it could lead to major hardware failure. All kind of suggestion will be appreciated. Please let me know if require any other details.

 


Walk to the server or have someone nearby do that.  Why do you think that a 'print_manifest' is going to help you? assuming that the server hasn't already shutdown.  You have a serious problem with your server or perhaps with the whole environment in which it resides!

 

Regards!

 

...JRF...

Amit_Mohabanshi
Occasional Advisor

Re: System Temperature is at non-recoverable level.

Hi James,

Thanks for reply. but I am only able to remotely access the server. The server hasn't gone down anytime after receiving this message and is up without any issue further. We have not found this warning message repeatedly in syslog. but though I want to make sure that now temperature level for system is normal and all fans are operating in normal speed which is required for proper functioning of components as well as good for health of the server.

Please help me to find out current system temperature status as well as fan speed. I had tried with SAM but no luck. is there any other command or way to find it?

------
Thanks & Regards
Amit Mohabanshi
Torsten.
Acclaimed Contributor

Re: System Temperature is at non-recoverable level.

Connect to the MP and run "ps":

 

 

MP:CM> ps

 

You also

 use the system management homepage:

 

http://your-server:2301

 


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
223848
Frequent Advisor

Re: System Temperature is at non-recoverable level.

Hi,

      Check the Logs from MP and from MP CM mode you can give PS the details

 

From MP go to VFP there check the LED status whether  is it green or amber.

 

Other than this i dont know whether you can get any  temperature alert is there or not

Amit_Mohabanshi
Occasional Advisor

Re: System Temperature is at non-recoverable level.

Hi,

I am aware about the way through MP but not able to access MP however I have access to only server. is there any way to check it from the server?  If there is any please let me know. As I am suspecting that problem should be resolved at server end as I haven't received any warning messages further. Eventhough just want to verify it for my confirmation.

 

I haven't found that message repeated later in syslog. Shall I assume that system is now operarting with normal temprature?

Please let me know how to proceed further?

 

Thank You for supporting to all.

 

---

Thanks & Regards

Amit Mohabanshi

Matti_Kurkela
Honored Contributor
Solution

Re: System Temperature is at non-recoverable level.

According to envd man page, the warning message and the corresponding action (if specified) will happen once, and only once, per level transition. So, unless you see an explicit message from envd, saying "Temperature is back to the normal machine operating range", you should assume the server is still running too hot.

 

In that event, you should assume that either something is blocking the server's airflow or there is a problem in the server room cooling. In both these cases, it's appropriate to alert someone who can physically check the server ASAP and make sure it's OK. Maybe the server room maintenance person simply needs to go and move some spare parts box or whatever that has been accidentally left against the air intake of this server and is blocking the airflow. Or maybe the maintenance person feels a heatwave as soon as s/he opens the server room door, and finds that the server room HVAC equipment has failed. That would be a very bad situation for all the servers in that server room.

 

If you have other systems whose state you can check, you may be able to estimate the severity of the situation. If other systems in the same physical location are also sending temperature alerts, it's probably a HVAC failure and requires immediate action to minimize damage: but if this is the only server that is too hot, it must still be checked: it might be airflow blockage, airflow management error (a "hot spot" in the server room environment) or something else.

 

Checking the envd messages in syslog is the way to check the situation from within the server OS. If that does not provide enough information, you should contact someone who is able to access the MP. But since there is a possibility that the entire server room may be overheating, and seeing the MP information will not solve the issue if the server actually is overheating, getting someone to physically check the server and its surroundings should be the most urgent action.

 

MK