Operating System - HP-UX
1752579 Members
4027 Online
108788 Solutions
New Discussion юеВ

Re: critical temperature warning difference ia64 and PA-RISC ?

 
SOLVED
Go to solution
Franky Leeuwerck_2
Super Advisor

critical temperature warning difference ia64 and PA-RISC ?

Hi,

I used to work with HP-UX PA-RISC systems and in case of an airco problem I nicely saw an ' OVERTEMP_CRIT WARNING ' message appearing in the syslog.

Now, I'm mostly dealing with HP-UX Itanium systems and in the scenario above the syslog only shows this message :

EMS [4471]: EMS Event Notification
Value: "MAJORWARNING (3)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3")
event details: /opt/resmon/bin/resdata -R 293011458 -r /system/events/ia64_corehw/core_hw -n 293011457 -a

I now know that I can retrieve more from the ELM message :

>-- Event Monitoring Service Event Notification --<

/system/events/ia64_corehw/core_hw is >= 3.
Its current value is MAJORWARNING(3).
Event data from monitor:

Event Time..........: Thu Nov 23 18:34:19 2006
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011
System..............: myHost.mydomain
Summary:
System Temperature is at non-recoverable level.


Is there a way to have those 'old' clear messages back into the syslog of an HP-UX Itanium ?

Regards
Franky
14 REPLIES 14
Calandrello
Trusted Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

Friend
in case that the temperature of schemes arrives at one definitive temperature, the serving anger to enter in way halt
Andrew Merritt_2
Honored Contributor
Solution

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Franky,
On PA systems, you will get an EMS event 33 from dm_core_hw when the first temperature threshold (Low) is reached, and you will get the OVERTEMP_CRIT message from envd in syslog at around the same time.

You then get an EMS event 34 when the next threshold (Mid) is reached, and envd should initiate a shutdown (according to the configuration in /etc/envd.conf).

If the last level (Hi) is reached, because the system is still running for some reason, it will be powered off.

On IPF systems, things have changed. At the low threshold, you should still be seeing OVERTEMP_CRIT in syslog. You will also get an EMS event, from either fpl_em or ia64_corehw (it depends on whether the system is cell-based or not). At the Mid overtemp threshold, the system will be shutdown by the Firmware, not by envd, and no events or messages get logged. This is a 'soft' shutdown. If the High threshold is reached, again it's a hard power off.

What system type do you have, and what version of the OnlineDiags (STM)? There were some problems where envd was not getting notified of the Low threshold (OVERTEMP_CRIT) on non-cellular systems, but should be fixed in current versions of the OnlineDiags.

Andrew


A. Clay Stephenson
Acclaimed Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

NOTE: You are asking the fundamentally wrong question because computers are lousy thermometers (and clocks). Go buy yourself a digital thermometer that has either a serial interface or a network connection. Even inexpensive models these days actually have a web interface. This will allow you to write your monitoring script one time and never have to worry about changes in software or models.

Even better, always have at least N + 1 cooling capacity so that you can tolerate the failure of any 1 unit without problems. The problem with relying upon a warning scheme is getting someone to actually shutdown the equipment in a timely manner. The computer may shutdown itself but what about other devices such as disk and tape drives which will continue to run even if the temperature excursion is extreme.

... and even better still is to have an auxiliary trip coil equipped main breaker connected to a thermal switch that will disconnect all power should a preset value be exceeded. This keeps you out of trouble should more than 1 of your HVAC units fail.



If it ain't broke, I can fix that.
Andrew Merritt_2
Honored Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

A. Clay, I understand the point you are making, but you're not answering the question that was asked.

The question was about the apparent differences in warnings for overtemp conditions. It was NOT about different temperatures at which the warnings appear.

Andrew
Franky Leeuwerck_2
Super Advisor

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi everyone,

Thanks for your answers all, Clay too, even that was not the question.

Franky
A. Clay Stephenson
Acclaimed Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

and the correct answer is still "Big Woo" because by the time any of you have done anything about this, it's almost certainly too late and permanent damage has already been done. One of the common exercises done in an intermediate electronics lab is to measure the characteristics of a semiconductor device such as a junction transistor and then subject the device to an intentional over-temperature condition and measure those same characteristics after the device has cooled back down to the original temperature. The characterics have been permanently (and irreversibly) changed. That's why measuring does little good and even monitoring schemes with a (real) thermometer are of limited value because by the time someone gets there, it's often much too late and potentially millions of dollars of damage has been done.
If it ain't broke, I can fix that.
Franky Leeuwerck_2
Super Advisor

Re: critical temperature warning difference ia64 and PA-RISC ?

I come back on this topic tomorrow, now I urgently need to pick up the kids.

See you..
Franky Leeuwerck_2
Super Advisor

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Clay,

Thank for the inputs.
I agree that in most case, logging temperature errors, makes no sense because the time to do an intervention is too small.
However, these systems are located on the other side of the planet and the warnings gives us a good idea to find out what was going on.

Franky
Franky Leeuwerck_2
Super Advisor

Re: critical temperature warning difference ia64 and PA-RISC ?

Andrew,

The STM version we run is :
Support Tools Manager, Version C.46.05, Product Number B4708AA


Franky