BladeSystem Server Blades
cancel
Showing results for 
Search instead for 
Did you mean: 

"Critical Temperature Threshold Exceeded" Before Boot in 20C Environment

 
Highlighted
Occasional Contributor

"Critical Temperature Threshold Exceeded" Before Boot in 20C Environment

I am attempting to upgrade some GPUs in a WS460c Expansion Blade and am observing a rather strange error condition - if I install a Tesla T4 card and insert the blade into the chassis, it fails before ever successfully booting with a "Critical Temperature Threshold Exceeded" error in the IML, refuses to power on via either iLO, OneView, or the physical button, and the red light on the blade itself flashes while the expansion blade piece remains green.  Removing the card and placing the blade back in the enclosure with either no cards in the PCIe slots, or one or two of the MXM carriers (w/ GPUs) works fine - no issues and boots successfully as normal.  The issue only occurs when installing the T4 card. 

The strange thing is this card has no external connectors - I don't know how it would even be communicating temperature to the BIOS, unless something is trying to read a temperature from the card and getting an invalid value back somehow?  Maybe something is loading via UEFI?  In any event, there are no thermal problems in the environment - facility keeps the air temp at 18-20C and it is quite stable - and the expansion blade itself reads well below any caution temps in the OA web UI when I check it while the WS460c blade itself is showing failed.  In normal operation all temps show well below caution on the WS460c in the OA view as well.  It is also strange that the IML does not identify the location / ID of temp sensor that is causing the issue, but it has been a long time since I have seen a blade fail out with this error. ( think the last one was a G6..?)

All components have been updated to the 2020.09 SPP baseline prior to installing the card.

Is there a way to identify what sensor is triggering the fault code?

Or does anyone know if there is some sort of special flash or update I need to do on a T4 card to prevent it from triggering this issue?

Failing that, is there any way to disable temp checking for PCIe cards somehow...?

Thanks for any help with this one!

1 REPLY 1
Highlighted
HPE Pro

Re: "Critical Temperature Threshold Exceeded" Before Boot in 20C Environment

Hello,

The system board has a sensor to read the error form all the installed components.

You can try to check after GPU replacement. I would suggest you to have a proper case be logged with HPE, and share the appropriate logs for further analysis,f the issue still persists

Because you already done the basic troubleshooting.

If you feel this was helpful please click the KUDOS! thumb below!   

Regards,


I am a HPE Employee

Accept or Kudo