ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ML350 G5 resetting

mastsrl
Frequent Advisor

ML350 G5 resetting

Hello,

i'm having a grave problem with a ML350 G5 server(45months old), around a month ago it started to spontaneously reset (it's important to note the server is operating 24x7 and no changes where made at all, last maintenance occured 10 months before) and drop drives from the arrays(marked them as disconnected, or removed, or connected cable problem, some drives marked all the time as not present or offline).

 

It started doing it once per day, then at random intervals of ~4hrs to 1hrs, even as close as 30mins after reboot...

It's not a software problem as it's a hard reset(no bsod or anything of the sort) and i've tested with an ubuntu from a pendrive and still does it.

 

Updated all the firmwares to the latest available now and it made no difference.

the drives issue solved itself, i switched cables around ports/plugs, switched positions of the ailing HDDs and it stopped

 

This is the current setup:

E5620

8GB 4x2GB

BBWC for e200i

E200 w/128MB BBWC

12 SFF SAS HDD (8 on e200i, 4 on e200)

2 NC110T NICs

1 x 1KW PSU

 

i've done multiple tests but can't isolate the problem succesfully.

 

All tests are isolating the installed OS to avoid software problems(ubuntu 11.04 64bit on pendrive) and always with drives powered(power connector plugged to drive/backplane).

 

summary of tests and results(X for fail):

  • Tested with a compatible PSU from a ML370 G5 companion server: X
  • Changed CPU: X
  • Changed PSU slot: X
  • Cleares NVRAM wtih dip switch: X
  • drive cage connected to e200i, NICs plugged, E200 removed (but drives powered): X (9hrs ~)
  • HDDs on e200 only, NICs plugged: X (~1:34hrs)
  • HDDs on e200 only, NICs unplugged: X (36hrs)
  • HDDs on e200iy, NICs unplugged, SC44G added for backup: X (3hrs)
  • all HDDs SAS cables unplugged, no e200,  NICs plugged: PASS (ran for more than 96hrs until test stop)

 

iLO2 log shows a weird event when the resets occur, it doesn't logs a server reset like it normally would on a reset, or ASR event(ASR was disabled on these tests) or thermal event, it logs:

Server power removed
Server power restored

at the EXACT same time

 

i've also discovered when i booted into a DOS pendrive that when i did a ctrl+alt+del the server did a small power dip(turned off and on for less than a fraction of a second) and when rebooted it marked one HDD as dropped from an array and started rebuilding, it logged the same exact event as before in ilo log.

I haven't been able to duplicate it after that with c+a+d.

 

Insight diagnostics give some HDD errors/selftest failures/above threshold errors and asks to replace some drives(but they are not marked as bad and ACU say they're healthy so i don't know what to believe).

It also shows a corrupted WWN on the e200i(blank).

 

an oddity that happened only once was that i ran the insight diagnostic full unattended test overnight(hot night, no AC) and the following morning the server was off, with both health leds blinking orange and the motherboard marked a failed CPU2 VRM (there's no CPU2....). I found a thread about that problem but apparently was a problem with rev2 psus, this server has rev 6 yellow dot psu.

It didn't do it again, even when running heavy linpack overnight and 90% fanspeed.

 

i have no idea what's going on, the tests are all inconclusive and make me waste a ton of time, and since it's way out of warranty i have close to no support from HP

6 REPLIES
BaneWalken
Visitor

Re: ML350 G5 resetting

there is a few step you can do.

1st is to test if the dimms are working properly. by testing it pairs by pairs.

but base on what you have said, auto reboot, inner health and PSU health amber, there is chances that ur systemboard is faulty, do you happen to have a server that is the same as this faulty one???

you can do a motherboard switch and see if it works.

all i can say is to do a isolation of it, parts by parts since the Insight diagnostic and other diagnostic tool you had use is giving weird answers.
Your work is going to fill a large part of your life, the only way to be truly Satisfy is to do what your beileve is great work, the only way to do great work is to love what you do
mastsrl
Frequent Advisor

Re: ML350 G5 resetting

i haven't really swapped the DIMMs because all the test i'm running i run them with heavy stressing tools that test the memory(linpack, prime95, insight diagnostic) and it's never popped a single ram error(the other option is an electrical failure in the dimms that brings the server down... but it's unlikely)

 

No other server remotely like this one available.

 

This is the problem, i can't reliably isolate the psu, the power backplane or the motherboard due to the bizarre nature of the errors.

 

It essentially boils to: connect SAS cable to ANY e200 controller -> resets

mastsrl
Frequent Advisor

Re: ML350 G5 resetting

Changed the ram for a couple sticks i had here(never used), it restarted all the same with HDDs connected to e200i

 

any ideas frrom HP?

Bernard Luksich
Occasional Advisor

Re: ML350 G5 resetting

 

This looks like an old thread but we had exactly the same problem and it took us weeks to locate the root cause.  So we want to share the solution in case others have something similar.

 

As reported about, the machine was running over a few years (ESXi 4.0) with no failures, faults, or resets.

 

Then about four weeks ago, the machine just "reset" itself and went through a cold restart.  The only information that the Integrated Lights Out Management reported was:

 

  1. Power restored to iLO 2.
  2. On-board clock set; was previously [NOT SET]

Also reported in the Integrated Management Log

 

  1. POST Error:  1792 - Drive array reports valid data found in array accelerator

We won't go through all the other things we tried that we received from HP and the different Forums.  All to no avail.

 

Finally, we got down to the "power chain" to the machine.  The machine has redundant power supply modules, each one is connected to different UPS system. 

 

We removed one of the redundant power supplies and its connection to its UPS.  At the point the machine run perfectly with no restarts.

 

We then returned that power supply back to service BUT connected it directly the AC mains (bypass the UPS).  The machine has continued to run with no restarts.

 

At this time the machine has one power module connected to a UPS, and the second power module is connected directly to the mains.  No problems.

 

So it appears that the one UPS must have some sort of problem that is triggering BOTH power modules to briefly turn off completely.  Not what you would expect with a redundant power supply configuration.  And just to be detailed, the power supplies are running in "Balanced Mode" so they share the power load.

 

It is not clear how one UPS could do this.  Also considering there are other devices connected to the UPS in question and they have not had any problems during this time period we would not expect this behavior.

 

If you are getting this situation, take a look a the complete power chain from the wall to the power supply module. 

 

That worked for us.  If the problem does return, we will update this post.

 

Thanks.

 

mastsrl
Frequent Advisor

Re: ML350 G5 resetting

Bernard,

unfortunately it doesn't applies as the restarts where happening on the customer site and then the server was removed to our labs and further testing done here, with direct power / line interactive ups.

It also happens with only one PSU plugged.

 

we ended up decommisioning the server, no response from HP and throwing money in useless spares is not a solution

Bernard Luksich
Occasional Advisor

Re: ML350 G5 resetting

Sorry about your problem with that server.  We are basically doing the same thing---the server is in the process of being decommissioned even though we seem to have the problem resolved.

 

The difference you had compared to our situation, may be that the iLOM log showed the "Server power removed" followed by "Server power restored".  We never had an entry for the loss power, only the "power restore" was logged. 

 

That is what made us think the power to the whole server complex was suddenly gone which meant the lights out management was unable to log it since it also had no power.  Also with the iLOM clock not being set, again an indicator that there was a complete loss of power. 

 

In any case, we just wanted to have this recorded somewhere, since no on-line data or tech note pointed to a possible UPS issue bringing down a server.  Especially with a redundant power supply.

 

Hope you have some good experiences with you current server batch.