Operating System - HP-UX
1752565 Members
5481 Online
108788 Solutions
New Discussion юеВ

Re: critical temperature warning difference ia64 and PA-RISC ?

 
SOLVED
Go to solution
Andrew Merritt_2
Honored Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Franky,
C.46.05 is a pretty old version (HWE0409), and you should update to a recent version (the current version is HWE0609, C.54.00).

The fix to the problem that envd was not reporting OVERTEMP_CRIT was included in the HWE0603 (C.51.00) release, so for any earlier releases (such as the one you have) the problem will be seen (on some systems; I think it was non-cell systems that were affected, but my memory is a little hazy). See JAGaf79895 in http://www.docs.hp.com/en/diag/ems/emr_0603_1123.htm

Andrew

Re: critical temperature warning difference ia64 and PA-RISC ?

A.Clay,

Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?
Bill Hassell
Honored Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

Matthew writes:

> Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?

The documentation is exactly the same as what happens when your building burns down. Is there some sort of documentation stating that items in the building will be damaged? How much?

The situation is self-evident. Once the temperature goes over 95 degrees (35 C) then bad things will happen. They will start with mysterious firewall and router issues that are intermittant. Then tape drives start having errors or start jamming with tangled tapes. Blade servers start crashing because they already run very hot.

And after you bring the temperature down, the problems continue -- electronic components do not heal themselves, they remain damaged forever. And the damage will appear as intermittent crashes and failures that diagnostics seldom locate because the components are only slightly damaged. And MOST important: some equipment is much better at tolerating high temperatures than others. High end HP servers have multi-speed and redundant fans to keep things cool. But your network equipment, big disk arrays or other servers may not have similar protection. So the HP will protect itself but it's useless because of other equipment that has been damaged.

Ask your management if they would like their paycheck calculated on machines that intermittently drop a digit or change the sign of a few numbers. Or to have their bank give them a statement where just a few checks are missing and a few others from some other accounts have been added.

When will this happen? When the computer room is too warm for humans to inhabit, generally above 95 F as mentioned before. Also, serious damage can occur because equipment is placed too close together so that inlet temperatures are well over 100 F in certain spots. Add up the total cost to to replace everything in the computer room. Would it be $100K or a million dollars, maybe more? And even if the insurance company pays for the damaged equipment, how long would the data center be useless -- days? weeks?

Quibbling over 50 thousand dollars to provide reliable cooling and prevent loss of a million dollars' worth of computer equipment is sheer folly.


Bill Hassell, sysadmin

Re: critical temperature warning difference ia64 and PA-RISC ?

Bill,

I think you should well know that management don't tend to take kindly to any sarcasm but just want the facts. an that's what I'm after really - technical facts.

Such as, having had the temperature at an average of 35 deg.C for a number of weeks, what kind of long term damage could we expect after the room is back to normal.
Eg. Tape drives, hard drives, interfaces, motherboards, CPU's etc..

If anyone can point me ni the direction of some sort of tests and their results from somewhere, I would be most grateful.
Andrew Young_2
Honored Contributor

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi.

Firstly Bill unfortunately the internal temperatures of most kit these days runs close to 30C in anycase and thats with decent cooling. However hot spots are already common.

You are right homwever that its not good to trust the temperature settings, LTO3 drives often report high tempratures above the tapes maximum recommended setting because due to a design fault the temperature sensor is on the opposite side of the unit in an area of restricted airflow.

We did some unintentional testing in our computerroom and once the airflow stopped within the computer room temperatures rose to 40C (and 19C) climb within 5 minutes. No alarms were triggerred though because the room temperature sensor was in the outflow duct and not in the room itself and since there was not circulation....

Because of the high humidity we have in this area we have had situations were summer temperatures in equipment rooms have risen to in excess of 70C after air conditioners have frozen over. At that temperature the solder starts melting on printed circuit boards.

But in overtemp situations you do have the increased risk of failure. We have a high incidence of Dead On Arrival Stock and we ascribed it purely due to the temperatures inside the delivery vehicles. All equipment is now shipped overnight for arrival during the early morning hours, or if its local in the late afternoon.

The biggest problems relating to heat are not so much semiconductor technology as these are designed these days with higher temperature thresholds over short periods of time, but with magnetic equipment and movable equipment. Disk drives develop bad sectors, platters start to warp, and magnetic tapes start to stretch, bearings also sieze. Having said that we do experience higher failure rates on overtemp equipment.

We have had a recommendation from a vendor to allow a cooling off time after high heat shutdowns before starting the equipment up. This also applies to relocating equipment. Its not always feasible but something to consider.

HTH

Andrew Y
Si hoc legere scis, nimis eruditionis habes