ProLiant Servers (ML,DL,SL)
1753504 Members
4969 Online
108794 Solutions
New Discussion юеВ

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

 
SduvallBG
Occasional Contributor

ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

This December I decided to add a second processor and more ram to our older Proliant ML350G4 server before I installed WS2008 x64 on it.

I took it out of the server closet after powering it down... moved it to my office. I then proceeded to install the 2nd VRM and CPU in it... I then found out my new CPU was shipped with the wrong heatsink so I locked everything back down and pressed the power button and the server said my CPU and VRM had failed. I didn't even touch them! I then removed the installed CPU, put in the new CPU I had gotten to put in slot2, plugged in the new VRM and put the existing heatsink on top of the new CPU powered up and it said this new CPU and VRM was bad.

Luckily that night I was able to load our server up on a VM and continue processing but now I'm looking at this server trying to figure out what is wrong.

Currently the server does not post... the VRM1 and CPU1 lights are on and the server health light is red. This server is not under warranty and it is a single core server but I still would like to use it if possible. I can't imagine two CPUs and VRMs going bad.... something else has to be wrong. Any ideas?
12 REPLIES 12
gregersenj
Honored Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

Allways

A conmputer must have an operational primary CPU.

So CPU 1 (0) and VRM, must be OK.

Speaking Proliant:
If a Proliant detect a CPU as defective, it will ASR and disable the defective CPU.

When you replace the def. CPU, you must manually mark it as repaired in the IML, and reboot the server to activate the new CPU.

In your case:
You have deactivated CPU 1 (0).
Sou you need to mark it repaired in the IML, but you can't do that, becaurse, it is the primary CPU, that has failed.

Solution:
Put in a known goog CPU and VRM in socket 1 (0)
Clear the CMOS, using the switch on the system I/O bd.
Check inside the lid, wich switch to use.
I believe it's sw6, but better check.

Allways.
Fist thing to do before adding / upgrading CPU.
You must upgrade BIOS.
If you got an old BIOS, it might not have the micro code for the new CPU.

BR
/jag

Accept or Kudo

SduvallBG
Occasional Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

Before I tried installing the new CPU I did upgrade to the latest bios revision.

I then started to install the new CPU... I simply lifted locking mechanism, installed the CPU and I started to put the heatsink in and realized it was the wrong size.... then I removed the new CPU and tried to reboot then it threw the CPU fail code on the old CPU...

I then removed the old CPU, put in the new one and used the existing heatsink on top of it (and applied thermal grease) and it said that was bad too.

I understand components can go bad... but this is just unusual.

This server has been running for about 5 years in the closet with less than 10 reboots... perhaps when I powered it down and moved it, it damaged it.

But why is the new CPU I just bought throwing the CPU fail code too?
gregersenj
Honored Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

Yes.

If CPU 1 (0), has been marked bad. It doesn't matter how manny new CPU's you put in, it will remain "bad" and disabled, until you mark it as repaired!.

The server does not figure out, that you have replaced the CPU by it self, you mut tell it, by marking the failed CPU as repaired..

So if CPU 1 or VRM 1 has gone bad. You must replace the faulty part, and mark it as repaired.
And the only way to do so, is to use the switch.
This is an issue with the primary CPU only.

BR
/jag

Accept or Kudo

SduvallBG
Occasional Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

I've already set switch 6 on the server maintenance switch to on.... doesn't seem to work, I even tried it again.

I click the switch to on, plug in the new CPU and VRM and power up the server. CPU FAIL and VRM FAIL are lit up and NB ALERT flashes once and it sounds like the server power flashes then the NB ALERT flashes again... over and over again.

I have a hard time believing one or both of these CPUS and VRMs are bad. I think something else is wrong with this... but it seems like I can't clear the NVRAM/CMOS...
gregersenj
Honored Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

To clear cmos:
Hit sw 6, power on for 10 secs.
Then its cleared
Power off, put sw 6 back to default.
Power on to check
Have you chekked the lid, that it's sw 6?

I totally agree with you, that 2 defective sets is very rare.

Did you clear the cmos, with both sets?
install set 1, clear cmos
Test
Install set 2, clear cmos
Test

Are you sure, you are using correct CPU + VRM. Is part numbers the same?

If you have done the aboave
Then you do have a problem.
Possibly:
System I/O board
A bad original set, and a incompatible set.

Trouble shooting would be easier if you had a full known good set, and then test all components one by one.

Note:
When you remove the heatzink, cooling paste must be replaced!
It's ok to test, but before putting it into production, old paste must be cleaned away and new applied

BR
/jag

Accept or Kudo

cliftyman
New Member

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

This is SduvallBG... couldn't log in to my old account so I created a new one....

I just bought a matching pair... VRM and CPU... I installed, cleared bios and then rebooted and this server still doesn't work.

Something else is wrong with this... whether its the board near the power supply (not sure what it is) or the power supply itself I'm not sure.

I really want to get this server up and running but I don't want to replace the motherboard unless I have to.
Mark Cassidy
Occasional Visitor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

Not sure if this will be of any help but I had a very similar issue recently with a clients server and after a few hours of fiddling around and swapping CPU's over it turned out to be a faulty motherboard.

This customers system was also running DUAL CPUs, the board was behaving as if one of the CPUs had failed and the trigger for me was when I swapped it back to a single CPU and tried each CPU in the single slot. Figured it couldn't be that both CPUs had failed, so pulled a board from a spare system that I had and it powered up like a dream.

Interestingly enough this client had been having a lot of issues with the server being quite noisy, the system fan had been kicking it at high speed quite a lot detecting the system was over temp but I had cleaned it out very thoroughly and airflow was fine. After replacing the motherboard it's running perfectly.

If you have a spare board you can put your hands on, it might be worth a try....

Mark.
gregersenj
Honored Contributor

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

Yes

This is ment for both of you.

2 bad CPU's or 2 bad VRM's, is not the most likely.

I agree on that.

But regarding adding CPU's or having a failed CPU (Including the VRM).

There's 3 issues (Traps).

1. There's must be a functional CPU in socket 1.

2. CPU's must be supportet, it might require a BIOS upgrade before adding.
Due to new micro codes.

3. A proliant disable failed CPU's.
To get CPU activated after replacement, it must be marked as repaired in the IML, then server must be rebootet.


So if you get a defective CPU 1.
You need to clear the CMOS, wich also clear the IML, and replace the CPU. And hopefully your BIOS have the micro codes for the new CPU.

And yes, it can all be down to a failed system I/O board.

BR
/jag

Accept or Kudo

cliftyman
New Member

Re: ML350 G4 will not boot, health light and CPU1 ERR, VRM1 FAILURE

The CPU was OEM... this bios version should support it.

I did click switch 6 on the maintenance switch... to clear the bios too.

I have a feeling I have several good CPUs and VRMS... I think its a power regulator, power supply or motherboard issue.