ProLiant Servers (ML,DL,SL)

DL380 G8 critical fault

 
vesil400
Occasional Advisor

DL380 G8 critical fault

Hello

Proliant DL380 G8 had powered off mysteriously 1 or two times at about 00:00 at night. Booted back up allright by just pressing power button. Now happened again and cannot start anymore. It does try for a fraction of a second but instantly cuts power and system led flashes red, which means system critical fault, right?

Ilo is the only thing that is working. Cannot switch to backup bios because the boot process has to begin to be able to switch. I tried removing everthing from the board, still only powers on for 1/4 sec.

I did download IML log from ILO and have a code that needs decoding, (see below, error 98) also interesting that system time apparently got reset to 1/1/1970 on the last "98" error after the final crash.

I have tried removing cmos battery, shorting pins, jumper nr. 6 on, removing every component from the mainboard but still shutdown in 1/4 sec.

Currently I have the mainboard loose in my hands and visually inspecting, dont see anything suspicious.

"ID","Severity","Class","Last Update","Initial Update","Count","Description",
"98","Critical","Power","01/01/1970 00:01","12/28/2020 21:54","12","System Power Fault Detected (XR: 14 00 MID: FF 0D FC 0E C0 FF FF 2F 2F 0C 0C 00 9C 20 00 01 03 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00)",
"97","Repaired","Power","11/25/2020 23:54","11/25/2020 23:53","1","System Power Supplies Not Redundant",
"96","Repaired","Power","11/25/2020 23:54","11/25/2020 23:53","1","System Power Supply: Input Power Loss or Unplugged Power Cord, Verify Power Supply Input (Power Supply 2)",

9 REPLIES 9
vesil400
Occasional Advisor

Re: DL380 G8 critical fault

Situation update:

Had the motherboard out for inspection, all looked well. Power supply still cut power immediately when startup happens. 

I took another 12V power supply and put wires to the incoming PSU rails and force fed the board 12V for a couple of minutes. After this it started staying on by itself. Took away feeding wires and the board now seems to have fixed itself, at least for now. Server is up and running, will see if it holds.

Since nothing really seems to have been wrong or broken, I'm confused as to why this happened.. Would still be nice to get that error code decoded, for future reference.

vesil400
Occasional Advisor

Re: DL380 G8 critical fault

Server was up for maybe 2 days but now shutdown again with error 


System Power Fault Detected (XR: 14 00 MID: FF 0D FC 0E C0 FF FF 2F 2F 0C 0C 00 9C 20 00 01 03 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

What is this?

vesil400
Occasional Advisor

Re: DL380 G8 critical fault

Restarted the server and now system fans 5 and 6 are running at 78% and 97% speed, was 27% before.

20-VR P1 Mem sensor now says temp is 96 degrees C, massive increase since last boot and probably a faulty reading.

Mem VR sensors 19, 21, 22 are at 23C. So this error might have something to do with the 20-VR P1 Mem sensor or voltage regulator?

pls, does someone know what the error code means?

 

Shrey27
HPE Pro

Re: DL380 G8 critical fault

Hello Vesil400,

 

The error code mentioned doesn’t necessarily correspond to the PSU failure and the error is pointing towards an issue with the system board of the server

 

However suggestion is to bring the server to minimum configuration with One Processor or One memory module and no PCI cards connected to the server and check the server status

 

In case the server continues to work in minimum configuration then start adding the components back to the server and check if any particular component causes the failure.

 

Also please ensure that the server is updated with the latest Service Pack for Proliant and the firmwares including BIOS,ILO, etc. are all on the latest versions.

 

Are there any Non HPE parts also being used on the server? Please confirm the same as well.

 

Thanks


I work for HPE

Accept or Kudo

vesil400
Occasional Advisor

Re: DL380 G8 critical fault

Hello

Yes I have tried minimum configuration, didnt work either with one cpu, one ram stick and no risers.

Currently server start then fails at random intervals, maybe instantly or 1 day..

Any info what the 20VRM memory temp sensor looks like and where it is located? Obviously giving a false reading.. 

Shrey27
HPE Pro

Re: DL380 G8 critical fault

Hello Vesil400,

Ideally the sensors would be the part of the system board.

Based on the error message you have been receiving and since the server is not stable, this looks like an issue on the system board.

I would suggest to open a case with the technical support with the hardware logs for detailed analysis and part replacement within warranty if need be.

Thanks


I work for HPE

Accept or Kudo

PZel
Trusted Contributor

Re: DL380 G8 critical fault

It can also be a problem of a defective SID (the display whats telling whats wrong: System Insight Display). 

Please disconnect the small flat cable to the SID ..  If the server then stay to work (it works wthout it), the SID is probably defecti.ve. It can be replaced by SPN:  662515-001 (DL380 with 3,5 '' disks)   OR  662516-001 (DL380 with SFF/2,5 '' disks)

PZ
vesil400
Occasional Advisor

Re: DL380 G8 critical fault

dl.png

vesil400
Occasional Advisor

Re: DL380 G8 critical fault

Server has been running 2 days without crash now, probably random but happy its working at least for a while.

Lacking other obvious problems, what about the VR P1 Mem temp sensor. Is it capable to shut down the system without logging any error for doing so? Apparently the sensor is defective, sometimes about 60-70 degrees and currently reading 90+

All the heatsinks are cold to the touch, and ambient temp is actually under 10C