- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: System Temperature is at non-recoverable level...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2011 02:43 PM - edited 09-27-2011 02:46 PM
09-27-2011 02:43 PM - edited 09-27-2011 02:46 PM
Hi All,
We have get below warning message from EMS.
****************************
EMS Eevnt: "MAJORWARNING (3)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 212271106 -r /system/events/ia64_corehw/core_hw -n 212271105 -a
****************************
After checking error details we found that this is replated to system temprature.
****************************
CURRENT MONITOR DATA:
Event Time..........: Mon Sep 26 14:13:14 2011
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011
System..............: drctweet
Summary:
System Temperature is at non-recoverable level.
Description of Error:
The system temperature is not within normal operating range. It is higher
than required operating range.
Probable Cause / Recommended Action:
Something may be blocking the cooling intakes of the fans. Check for
obstruction.
One or more fans may be operating at lower speed than normal. Check the
fan performance.
Check for problems with the room air conditioning.
If the problem is not fixed, the operating temperature may become
non-recoverable, in which case there are chances that the hardware may be
damaged. At that temperature level, the action specified in the envd
configfile will be taken - which may be to shutdown the system
automatically.
For information on the sensor that generated this event, refer to FRU ID
in Event Details section.
Additional Event Data:
System IP Address...: 192.168.5.101
Event Id............: 0x4e80c0ba00000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_ia64_corehw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/rp4440
EMS Version.....................: A.04.00
STM Version.....................: A.43.00
System Serial Number............: USE45358BL
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/ia64_corehw.htm#101011
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
Event Details :
Event Date .............: Mon Sep 26 14:13:10 2011
Sensor Number ..........: 0xd7
Sensor Type ............: Temperature
Sensor Class ...........: Threshold based
Sensor Reading/Offset...: 0x07 (Offset)
Event Type.............: Assertion
Entity ID ..............: 23
Generic Message.........:
Temperature : Upper non-critical - going high
Entity FRU Id Info......:
system chassis (Sensor ID: FP Ambient Temp)
****************************
Here we need to check the fan performance status but I am not able to execute print_manifest command. could anyone please tell me how can I check system temprature or current fan parameters?
Please help to find out for possible resolution for this problem as it could lead to major hardware failure. All kind of suggestion will be appreciated. Please let me know if require any other details.
----
Thanks
Amit Mohabanshi
Solved! Go to Solution.
- Tags:
- fan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2011 03:17 PM
09-27-2011 03:17 PM
Re: System Temperature is at non-recoverable level.
@Amit_Mohabanshi wrote:Here we need to check the fan performance status but I am not able to execute print_manifest command. could anyone please tell me how can I check system temprature or current fan parameters?
Please help to find out for possible resolution for this problem as it could lead to major hardware failure. All kind of suggestion will be appreciated. Please let me know if require any other details.
Walk to the server or have someone nearby do that. Why do you think that a 'print_manifest' is going to help you? assuming that the server hasn't already shutdown. You have a serious problem with your server or perhaps with the whole environment in which it resides!
Regards!
...JRF...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2011 03:57 PM
09-27-2011 03:57 PM
Re: System Temperature is at non-recoverable level.
Thanks for reply. but I am only able to remotely access the server. The server hasn't gone down anytime after receiving this message and is up without any issue further. We have not found this warning message repeatedly in syslog. but though I want to make sure that now temperature level for system is normal and all fans are operating in normal speed which is required for proper functioning of components as well as good for health of the server.
Please help me to find out current system temperature status as well as fan speed. I had tried with SAM but no luck. is there any other command or way to find it?
------
Thanks & Regards
Amit Mohabanshi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2011 09:55 PM
09-27-2011 09:55 PM
Re: System Temperature is at non-recoverable level.
Connect to the MP and run "ps":
MP:CM> ps
You also
use the system management homepage:
http://your-server:2301
Hope this helps!
Regards
Torsten.
__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.
__________________________________________________
No support by private messages. Please ask the forum!
If you feel this was helpful please click the KUDOS! thumb below!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2011 11:21 PM
09-27-2011 11:21 PM
Re: System Temperature is at non-recoverable level.
Hi,
Check the Logs from MP and from MP CM mode you can give PS the details
From MP go to VFP there check the LED status whether is it green or amber.
Other than this i dont know whether you can get any temperature alert is there or not
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-28-2011 03:12 AM - edited 09-28-2011 03:14 AM
09-28-2011 03:12 AM - edited 09-28-2011 03:14 AM
Re: System Temperature is at non-recoverable level.
Hi,
I am aware about the way through MP but not able to access MP however I have access to only server. is there any way to check it from the server? If there is any please let me know. As I am suspecting that problem should be resolved at server end as I haven't received any warning messages further. Eventhough just want to verify it for my confirmation.
I haven't found that message repeated later in syslog. Shall I assume that system is now operarting with normal temprature?
Please let me know how to proceed further?
Thank You for supporting to all.
---
Thanks & Regards
Amit Mohabanshi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-28-2011 07:38 AM
09-28-2011 07:38 AM
SolutionAccording to envd man page, the warning message and the corresponding action (if specified) will happen once, and only once, per level transition. So, unless you see an explicit message from envd, saying "Temperature is back to the normal machine operating range", you should assume the server is still running too hot.
In that event, you should assume that either something is blocking the server's airflow or there is a problem in the server room cooling. In both these cases, it's appropriate to alert someone who can physically check the server ASAP and make sure it's OK. Maybe the server room maintenance person simply needs to go and move some spare parts box or whatever that has been accidentally left against the air intake of this server and is blocking the airflow. Or maybe the maintenance person feels a heatwave as soon as s/he opens the server room door, and finds that the server room HVAC equipment has failed. That would be a very bad situation for all the servers in that server room.
If you have other systems whose state you can check, you may be able to estimate the severity of the situation. If other systems in the same physical location are also sending temperature alerts, it's probably a HVAC failure and requires immediate action to minimize damage: but if this is the only server that is too hot, it must still be checked: it might be airflow blockage, airflow management error (a "hot spot" in the server room environment) or something else.
Checking the envd messages in syslog is the way to check the situation from within the server OS. If that does not provide enough information, you should contact someone who is able to access the MP. But since there is a possibility that the entire server room may be overheating, and seeing the MP information will not solve the issue if the server actually is overheating, getting someone to physically check the server and its surroundings should be the most urgent action.