System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

DL585 G1 System Overheating Issue

 
digriz_1
Occasional Visitor

DL585 G1 System Overheating Issue

We recently upgraded a bunch of DL585 G1 (running RH Linux 4.7) to PSP 8.30 (hp-health-8.3.0.43-30.x86_64)

Now several of them are reporting System overheating problems, i.e.


messages:May 10 17:03:29 xxxxxxxx hpasmd[12590]: WARNING: hpasmd: System Overheating (Zone 5, Location CPU, Temperature 111C)
messages:May 10 17:03:29 xxxxxxxx hpasmd[12590]: CRITICAL: hpasmd: Automatic Operating System Shutdown Initiated Due to Overheat Condition

Temperature seems to high to be real to me, assume its either a software or hardware bug

Anyone got any ideas ?
And will upgrading to PSP 8.40 solve this issue, theres nothing in the 8.40 release notes related to this
15 REPLIES
Michal Kapalka (mikap)
Honored Contributor

Re: DL585 G1 System Overheating Issue

hi,

if this is the same issue on all servers, it could be some bud on PSP layer, but if its only on one machine, i would recomend to make HW healt check.

maybe the upgrade to the neves version of PSP it could be help, sometimes not all bugs will be reported in the release notes.

mikap
digriz_1
Occasional Visitor

Re: DL585 G1 System Overheating Issue

We have 64 DL 585 G1, 46 have been upgraded to PSP 8.30 and of those 6 servers have had this overheating problem, in 2 different data centers and we never had an over heating problem before the upgrade
SERKAN AKÇIN
Occasional Contributor

Re: DL585 G1 System Overheating Issue

Hi,

We have 4 dl585 servers.
2 of them rhel 4.7 and have no problem.
But
2 of them rhel 5.4 and giving temp errors.

We don't have temp problems before upgrade.

HP says, you must change the mainboard but, I think it is occur after the firmware update and psp updates.

Do you find a solution?
Steven E. Protter
Exalted Contributor

Re: DL585 G1 System Overheating Issue

Shalom,

Use the web based PSP interface ( http://hostname:2301 ) to check actual temp.

These servers are pretty old, and you may have poor airflow or bad fans. All of these issues can cause overheating.

Clearly the software is more sensitive.

If this were a very severe problem, the system would have failed long ago. Still, run through the checklist, give them the eyeball check, make sure the fan openings are not covered with dust accumulation.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Gerardo Arceri
Trusted Contributor

Re: DL585 G1 System Overheating Issue

Please upgrade firmware on the servers, we hit a similar bug couple of years ago and Firmware fixed it.
IF YOU ARE SURE THAT THE SYSTEM IS NOT OVERHEATING, use RBSU (Bios Setup) to disable Thermal Shutdown.
kiheiman
Occasional Visitor

Re: DL585 G1 System Overheating Issue

We are seeing the CPU over-heating events on a number of DL585G1 Linux servers. The problem started about 6 months ago and seems to come in spurts. After the server powers down, it continues to generate the over-heating events in the IML - but the CPU heat sinks are not hot. The fix is to pull both AC power cords and boot the server back up. We have tried replacing motherboards, moving CPU modules around, etc. The problem appears to happen with both iLO driver versions 8.40 and 8.50 (not HP supported). It is also happening with Linux version 4.7 and 4.8. A number of the servers are on iLO firmware version 1.8x, but I do not see anything in the fix info that older versions would cause an over-heating event.
Alzhy
Honored Contributor

Re: DL585 G1 System Overheating Issue

Update your Firmware.
Use the hpsum to check and update the various firmware of you G5.
Hakuna Matata.
kiheiman
Occasional Visitor

Re: DL585 G1 System Overheating Issue

Here is some more info on our situation.

Servers with the latest version of the motherboard firmware and iLO firmware are crashing. Servers with old versions of iLO firmware are not crashing. Servers running RHEL 4.7 are not crashing. The server crashes did not start until we upgraded some servers to RHEL 4.8. We are not running a PSP other than the 8.40 iLO driver. The smoking gun, based upon forum comments, seems to point to any server running a RHEL release newer than 4.7.
SERKAN AKÇIN
Occasional Contributor

Re: DL585 G1 System Overheating Issue

We changed the heatsink of the 4 processors and the problem is gone.

HP says that in the production of G1 servers in the years 2004 and 2005, there is a heatsink metal alloy problem. We changed them and the craches are gone.

digriz_1
Occasional Visitor

Re: DL585 G1 System Overheating Issue

As I started this thread, thought I would give an update.

We have tried upgrading PSP and BIOS/ILO firmware combinations with no success. Servers still overheat, crash and refuse to boot until they have cooled down. Although physical inspection shows no heat issues

What we have done is "Disable Thermal Shutdown" in Bios Setup and this has stopped the crashes without issue
kiheiman
Occasional Visitor

Re: DL585 G1 System Overheating Issue

On a couple of servers, we have disabled the high temperature shutdown in the RBSU and that has stopped the servers from crashing. On those servers, we now see a high temperature alert in the IML and the server continues to work. I guess the temporary fix is to disable the temperature shutdown option in all of our 585G1 servers, but it would be nice to know why this is happening. HP is going to analyze our sosreport to see if they can find anything.
Alzhy
Honored Contributor

Re: DL585 G1 System Overheating Issue

"What we have done is "Disable Thermal Shutdown" in Bios Setup and this has stopped the crashes without issue"


VERY dangerous amigo...It could fry your server.

Press HP on what the issue is and if it is a bad hp-healthh release or a firmware needing an update to match your new PSP bits.

Hakuna Matata.
Dave.
Valued Contributor

Re: DL585 G1 System Overheating Issue

Did you ever get this resolved? Did you upgrade to PSP 8.40 or later?

I found this (update to hp-health driver and wondered if it applies:

"Fixed a problem where if ambient temperature is very low, hpasmlited can incorrectly determine the system is over a temperature threshold and improperly shut the system down."
Link to RHEL 5 x86:-
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=398220&swItem=MTX-2f2a4208241845c485f7a847e6&prodNameId=3288126&swEnvOID=4006&swLang=8&taskId=135&mode=4&idx=2
Please let us know the outcome, Dave.
Regards, Dave
Dave.
Valued Contributor

Re: DL585 G1 System Overheating Issue

Please ignore my question on how you got it resolved. I missed the answer already posted.

But I would be interested if the hp-health driver update fix I posted has been used by anyone successfully or not.
Regards, Dave
kiheiman2
Occasional Visitor

Re: DL585 G1 System Overheating Issue

An update to our current high temp alerts. We had one server that was generating alerts about every day - the thermal shutdown had been disabled. We started the process for upgrading to the 8.40 version System Health Agent (current version was 8.50). By the time we did the upgrade, the alerts had been gone for one month. Very strange. The previous referenced HP Advisory about "the alerts being generated because the server is too cold" is most interesting. We had been looking to see if any server cabinet was getting too hot.