ProLiant Servers (ML,DL,SL)
1751712 Members
5646 Online
108781 Solutions
New Discussion

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

 
oakshade
Occasional Advisor

DL380 g6 shutdown from overheating (Zone 19, Location CPU)

I hope this is posted in the right place.

This DL380 g6 server is equipped with dual X5650 and 64GB ram, 4 disk raid 10, dual 750watt ps.

The machine just shutdown unexpectedly.  ILO2 reports:

System Overheating (Zone 19, Location CPU, Temperature 73C)
Informational.

The machine attempted to start back up but just kept starting and then shutting down within a second or 2.  Letting it sit awhile, it later booted and ran for 12 or so hours before repeating the same routine.  This time it will not boot but ILO2 reports:

Power loss due to overheating.  Attempting to restore power.

...then it immediately shuts back down.  The health light is amber indicating a temperature caution.

If it is sitting for an hour with the power off and it reports overheating when I try to boot cold, then it must be a faulty reading.  The machine is cold.

I noticed that when it booted and ran OK after the 1st failure, temp 19 was around 30C.  Just before the second failure, I heard the fans start to spin up (75%) and I knew it was going to happend again, so I recorded temp 19  at around 63C and climbing, then it shut down.

I suspect a faulty temp sensor but have no way of determining which component is going bad.  Could this be in the CPU, and if so which one.  Is zone 19 on the MB or where?  I have found no documentation to tell me.

Has anyone else seen this failure and what did you do to fix it?

Any help would be appreciated!

 

9 REPLIES 9
Erdogan Temur
HPE Pro

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

Hi,

Please install latest bios and ilo firmware version.

http://h20566.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=3884082&docId=emr_na-c03659941&lang=de-ch&cc=ch

SUM Link.

https://downloads.hpe.com/pub/softlib2/software1/cd/p1040529012/v71197/firmware-10.10-0.zip

BR.

 

Kind Regards,
Erdogan.
No support by private messages. Please ask the forum!

Accept or Kudo

oakshade
Occasional Advisor

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

@Erdogan Temur

Thanks for the reply.  It sounds like this could be on the path to solving my problem, however at this point I am not able to boot the machine to be able to update the system ROM.  It will not boot but only starts for a few moments when the switch is pushed, then it shuts down immediately.  It seems that whatever the false condition (temp zone 19 reading) is that created the problem, it has become a fixed state.  

I suspect the RBSU Thermal shutdown setting is set to disabled, so the unit just shuts down.  Quite a pickle of a situation.

I beleive the state is a false reading because the machine could not read an actual temp of 70c (shutdown threshold) for temp zone 19 when it is cold.

Do you have any suggestions to get me past this problem; to actually get the machine to boot, or to disable the RBSU Thermal shutdown.  

I thought about removing one of the CPUs,  thinking it might be a false metric produced by the CPU as it was identified in the location of CPU, but I'm not sure which one zone 19 is.

Thanks for your help

Erdogan Temur
HPE Pro

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

Hi,

Can you attach by creating a hpsreport  with the following tools.

I can analyze report and give the right answer.

http://hpsreports.glb.itcs.hpe.com/HPSreports/

BR.

 

Kind Regards,
Erdogan.
No support by private messages. Please ask the forum!

Accept or Kudo

oakshade
Occasional Advisor

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

@Erdogan Temur

Sorry I should have mentioned that this system is currently "running" ESXi 5.5, and it will not start so unless I can run it from a windows machine on my network, then no I can not create a hpsreport.

Also, I found my system ROM is reported by ILO2 to be :

P62 07/02/2013, which I believe is 2013.07.02  - cp021344.scexe

with backup system ROM:  05/05/2011

Is there a way to revert  the machine to use the back-up ROM to see if it will boot from it?

Keep in mind that I can only use the ILO2 currently for any type of software diagnostics or configuration, which I know is very limited, so my options seem to be manually moving parts in/out or something physical to move beyond where it is stuck at now.  It will not boot.

Again, thanks for the help

 

 

oakshade
Occasional Advisor

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

Just to clarify what is happening to this server for the observant troubleshooters out there, who may have run accross the same or similar failure, this DL380 G6 will not boot, but indicates both a red health light (blinking), and an amber Overtemp light (solid) on the front.  This would seem to indicate an overtemp condition, but the machine is cold, so it has to be a false signal.

From a dead cold start, where the machine has been sitting without power for a few hours, when I push the power button, you can hear the fans start to spin up for about a second, then it shuts back down and you can observe the lights described above.  At this point it trys that cycle about 3 times then it stops.  

ILO2 reports nothing.

This machine runs ESXI 5.5 with a variety of VMs for the company system.  It ran almost continuously for 3 years, was rarely shut down, without any problems, then boom, in a matter of 24 hours, system hosed.  

I am almost at the stage where I believe that this may be a hardware failure of the motherboard/and/or one of the CPUs, and I may have to just bite it and replace one or all.  That would not be a good option as I am not sure what other can of worms I am opening.

I am not aware of what happens to my existing 4 disk raid-10 array when I replace the existing MB and RBSU/bios.  Will the p410 recognize it as an existing array, or will I loose all my data and have to reconstruct?  What are my options?

In my personal experience, motherboads rarely fail from old age, unless there is a common design flaw, then it is usually known and documented. but I suppose it happens.

Anyone with a suggestion, please feel free to give me an opinion or advise.  It would be most appreciated.

Thanks all

Erdogan Temur
HPE Pro

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

Hi,

Did you renwed the processor thermal grease?

It is very important that the ilo and bios firmware are up to date to solve this problem. Did you perform the SUM update?

Can you upload vm_support log.

BR.

 

Kind Regards,
Erdogan.
No support by private messages. Please ask the forum!

Accept or Kudo

oakshade
Occasional Advisor

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

@Erdogan Temur

Again, thank you for your interest and help.

Checking and renewing thermal paste is one of the 1st things that I did.  I did not think it would change anything and I was right, however, at least I checked that off my troubleshooting list early on.

As for updating the ILO , I installed the latest ILO2 firmware (2.29) about a month ago.  I would love to update the system ROM as you suggest, but I can not get past the fact that the machine will not boot in order to update the ROM.  I feel like I am missing something in our discussion.  You seem to believe that I can update the ROM, so please tell me how to do it without booting.

Maybe you are saying: "Do whatever you have to in order to bring the machine up, then update the ILO and system ROM to prevent the problem from occuring again"  That would make sense of this coversation.  Please tell me if that is not correct.

At this point I have a refurbed replacement motherboard on order to replace it and hope that it will boot after that.

I know these machines do not have a great deal of value at this point, so why should I sweat the details....? but I have lived with this beast for a few years now, and it is the devil that I know.

BTW, the link you gave me for the 10.10.0 ROM update, thank you for that.  However, it does not contain cp021344.scexe  (2013.07.02) which is needed just to bring it up to what is already installed in the machine?   

Erdogan Temur
HPE Pro

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

Hi @oakshade

Perform the following step as well. If the problem persists send the device in for repair.

Step 1
1. Shutdown the server and disconnect hard drives.
2. Power off the server and reset the power supplies.
3. Locate the System maintenance Switch on the system board and set the following Switch configuration.
S1 = ON
S2 = OFF
S6 = ON *
NOTE: * : When the system maintenance switch position 6 is set to the ON position, the system is prepared to erase all system configuration settings from both CMOS and NVRAM.

Step 2
1. Install SPP 2014.06 update.
2. Power off the server and remove the power supplies
3. wait 5 minutes
4. Plug in power supplies and start server.

BR.

Kind Regards,
Erdogan.
No support by private messages. Please ask the forum!

Accept or Kudo

oakshade
Occasional Advisor

Re: DL380 g6 shutdown from overheating (Zone 19, Location CPU)

@Erdogan Temur

The problem persists.  I set the switches and followed your procedure very precisely as you suggested, 3 times  just to be sure, but alas, nothing has changed. Then I tried every combination of power on/off, set switches and so on, but it will not move beyond the 1 second fan spin and red flashing health light and amber overtemp light, before is shuts back down.  I even tried setting the number 5 switch on, which is supposed to boot from the back-up rom, but the same story.

The interesting thing is that the ILO2 detected a change in switch 1 and reported the security to be off, but registered no detection of a change in either 6 or 5.  That does not inspire confidence.

I'm afraid I will have to replace parts (MB) until I have some movement.  I may need some advice there.

BTW, when I do get it to boot, how do I get the SPP 2014.06 you suggested?

Thanks for the help Erdogan