ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Sparking Motherboard - Frequent HDD Failure Rate - ML350 G5

 
SOLVED
Go to solution
Highlighted
snoopsmsc
Occasional Visitor

Sparking Motherboard - Frequent HDD Failure Rate - ML350 G5

  • I have two ML350 G5 (model 412645-B21) servers.  They were originally purchased as:
  • Single processor
  • 2x 250GB SATA 7200RPM disks
  • Integrated Storage Controller (E200?) with BBU
  • 4 GB RAM

These servers are being used as an ESX hosts, and so has gone through several rounds of upgrades.  Now, the servers are:

  • Dual processor (plus power module)
  • 2x 250 GB Sata 7200 RPM disks
  • Integrated Storage Controller with BBU
  • 24 GB RAM

The latest upgrade performed was a memory upgrade from 16 GB RAM to 24 GB RAM.  This was performed either late September 2011, or early October 2011.  The first server went off without a hitch.  However, after installing the memory in the second server, upon powering the server on, I saw sparks generated on the motherboard.  The server was beeping at me, so I shut it off, put the old memory back, and tried to turn it on.  It posted 16 GB RAM.

 

I thought after seeing sparks I was toast, and would have to give the bad news to the boss.  Since it posted with the old memory, I thought I'd try the new memory again.  Lo and behold, it posted with 24 GB RAM.

 

I thought to myself that this couldn't be good, but I decided to let sleeping dogs lie.

 

(Let me say, I've done a LOT of hardware upgrades in my day.  I'm not a hardcore hardware guru, but I know what I am doing, and didn't drop any screws or otherwise short the motherboard out.  The memory was properly seated when I first powered the server on, and the memory pairs were installed in the proper banks.)

 

Then, in mid october, the two hard drives died on the sparking server.  They weren't simultaneous (a few days apart), so I convinced myself it was a coincidence, rebuilt the OS, and thought I was out of the woods.

 

However, then in late January 2012, another drive died. I was starting to get suspicious.

 

Now, just this week, it killed another drive.  Four failed drives, all within 6 months of each other...  Clearly this is too many failures in a single machine (the other server hasn't experienced any drive failures).

 

These drives have all been confirmed as failed using Seagate's SeaTools running on a separate machine.

 

I feel at this point it is foolhardy to continue throwing drives in this machine, given it's appatite for HDD blood.  I've got to address the issue and advise my boss on replacing failed parts, or replacing the entire machine.

 

I've run HP Insight diagnostics on the machine, which found nothing.  My gut feeling is to replace the motherboard, but I suppose it could also be an issue with the power supply.  I would never hear the end of it if we replace the motherboard, and it gets fried by a failing power supply.

 

One last thing I should mention...  Aside from destroying every drive I put in it, the server seems to work fine.  No strange reboots, OS stops, beeping at POST, etc.

 

So, after this torturously long story, my questions are:

  • Is there a more accurate/detailed/low-level method for testing the motherboard and power supply, than simply using the HP Insight Diagnostics?
  • If I were to send the server (which is out of warranty) to HP support, how would they proceed to determine the issue (or woudl they simply replace it)?
  • Should I also consider replacing the drive cage/storage backplane?  How about the power supply backplane?
  • Are there any other questions I have not asked that are relavent to resolving my issue (answers to such questions are much appreciated)?

I would consider myself much more OS/systems oriented than hardware oriented, so any help from the hardware geeks is greatly appreciated.

 

Thanks,

Monty

7 REPLIES 7
Highlighted
Michael A. McKenney
Respected Contributor

Re: Frequent HDD Failure Rate - ML350 G5

I have an ML350 G5 with a Carepaq.   Firmware took out 5 server boards and 12 FC2142SR fiber cards in 4 years.  You are dropping drives?   I would not do SATA in any server.   SATA will not last 24x7x365.   

Highlighted
Johan Guldmyr
Honored Contributor

Re: Frequent HDD Failure Rate - ML350 G5

"These drives have all been confirmed as failed using Seagate's SeaTools running on a separate machine."

 

Isn't that enough? So the drive failed - but why?

 

It is quite common for SATA drives to fail if they are over-used.

 

I don't have any documents but there is a white paper for the EVA that says that FATA drives (fibre channel SATA drives) are only meant to be used 30% or 13/5 - not 24/7.

 

If you run a server that has activity 24/7 you should not be using SATA drives.

 

So, after this torturously long story, my questions are:

  • Is there a more accurate/detailed/low-level method for testing the motherboard and power supply, than simply using the HP Insight Diagnostics?
  • ### Probably not for the embedded SATA controller.
  • If I were to send the server (which is out of warranty) to HP support, how would they proceed to determine the issue (or woudl they simply replace it)?
  • ### HP would most likely not replace the whole server. If it's out of server you'd have a 'trade' call where at least in theory you should only pay for man-hours and the parts that are actually needed to fix the problem. Can get quite expensive.
  • Should I also consider replacing the drive cage/storage backplane?  How about the power supply backplane?
  • ### Instead of spending money on replacing things, I'd get more robust type of hard drives and an hp smart array controller that's compatible with your server. 
Highlighted
snoopsmsc
Occasional Visitor

Re: Frequent HDD Failure Rate - ML350 G5

Your points are duly noted. My preference would also be some enterprise level drives.

 

However, this ignores the point that my other two servers (there is one more server at remote location that I didn't mention in my story) have never dropped a single drive, and they are also running SATA drives.

 

The reason that SATA has been acceptable for me on these machines is that all of our critical storage is on a NetApp array running 12 SAS drives, which these servers access via NFS. I found it to be a bit excessive to use expensive disk for volumes that don't require enhanced performance or warranty levels. We're on a fairly small budget, and I don't think I'd be able to sell 2x HDDs at $400 each for three servers, especially after having sold the boss on a pair of $10k storage devices. These SATA drives are only providing boot and swap for the ESX hosts.

 

Furthermore, the disks in these servers are the ES.2 series drives from seagate - supposed to be more enterprise level. I will concede a shorter MTBF on SATA drives in general, but for a server to drop 4 drives in 6 months, while the other two servers haven't dropped a single drive in a combined 8 years, seems to be excessive.

 

Also, consider the workstations in your own environments... I know in mine, there are many users who leave their computers on 24x7. During the 4 years with my current employer, I cannot think of a single workstation that has incurred an HDD failure. Sure, I've seen failed SATA drives, but not at this rate.

 

SATA may be more prone to failure, but they aren't that bad - I'm not convinced that an average lifetime for a SATA drive of 1.5 months is in any way normal or to be expected, regardless of utilization/load. While I'll concede that SAS or fibre channel drives may be a more robust, enterprise level option, SATA should work, and shouldn't be failing so frequently.

 

Finally, don't forget the sparking. In my experience, a sparking motherboard has never lead to anything good. ;)

 

For the sake of arguement, let's assume that these 4 drives didn't die of natural causes, and there is some sort of voltage regulation (or other out of range electrical fluctuation) issue on one of the other components in the server, causing the drives to fail.

 

If the motherboard has issues, it is reasonable to assume that a shiny new storage controller might be fried by the motherboard.

 

If it is the power supply that has issues, enterprise level drives should be just as susceptible to electrical flucuations as consumer drives (this is just an assumption, but I can't find any information to the contrary).

 

I didn't dig TOO much into the hard drive diagnostics, but I really feel the failures are with the electronic components in the drives, not mechanical issues. No weird noises are coming from these drives. Now that I've said that, though, I'm curious. I'll see if I can get more information about the failures.

 

Long story short, I am having a hard time accepting that this server is not at least partially responsible for these drive failures. I don't believe that these drives would have failed so quickly in either of the other two servers. I need to figure out whether to diagnose this server (and how to do so), replace all of the components that may be the problem (motherboard, power supply, backplanes), or just replace the thing.

 

Lastly, I apologize for the length of this. My bosses are always complaining that my emails are too long, and considering that you are taking time to offer some advice, I should apologize for wasting your time with my ramblings. :) However, your help is appreciated.

Highlighted
Johan Guldmyr
Honored Contributor

Re: Frequent HDD Failure Rate - ML350 G5

"Also, consider the workstations in your own environments... I know in mine, there are many users who leave their computers on 24x7. During the 4 years with my current employer, I cannot think of a single workstation that has incurred an HDD failure. Sure, I've seen failed SATA drives, but not at this rate."

 

In a larger array with so called enterprise FATA drives under heavy load (so not in laptops where they perhaps go into sleep mode or are not actually used most of the time) I've seen drives suffer from enough errors until eventually it was failed by the array. And that happened on disks once a week.

 

But I suppose the way your server is set up the disks aren't actually used that much.

 

It would be interesting to see if that seagate tool could give a more detailed reason on what tests the disks failed.

 

And yes, because your post was so long I missed the part about sparking ;)

 

Looks like the E200i is the default controller.

 

I thought you were using an integrated SATA controller without RAID :p

 

Is a bit confusing that it's an integrated 'E200i' but the service and maintenance guide has a spare part for the E200 controller, so it's not really integrated?

 

HP has a tool (HP array diagnostic utility - ADU) that gives _lots_ of information and produces a report.

 

Power Supply Backplanes are usually passive components, meaning they shouldn't fail themselves so often but it does happen.

Highlighted
snoopsmsc
Occasional Visitor

Re: Sparking Motherboard - Frequent HDD Failure Rate - ML350 G5

I apologize about my server description yesterday. I was trying to get out of the office, and to be honest, was too lazy to look up the controller in the server.

 

I have confirmed that it is an E200i, not an E200.  For what it's worth, it has 128MB cache w/bbu.  Seems like the E200 was an option, but is an add-in card, as opposed to the E200i, which is onboard.

 

Another bit of detail I didn't include earlier (because I had come to a perhaps premature conclusion that the problem was strictly a hardware problem) is that the disks in the servers were configured as RAID 1.

 

In my head, the in-server sata drives were just start-up drives - the heavy lifting is done with the network storage arrays.

 

Also, a little better description of the sparking: I would say it was more of an "exploding resistor" spark than a simple electrical arc bridging to contacts.  I could smell it afterwards.

 

My instincts tell me to simply replace the motherboard, but I'm nervous that the damaged motherboard caused damage in the other components, such as the backplanes or PSU.  If I throw in a new motherboard and it gets fried, I'll never hear the end of it.

 

Suggestions?

 

Edit: I ran the Insight Diagnostics, which came up empty handed as far as conclusions.  I'll take a look at the ADU.

Highlighted
Johan Guldmyr
Honored Contributor
Solution

Re: Sparking Motherboard - Frequent HDD Failure Rate - ML350 G5

I suppose you could inspect the components. A smell or spark should be visible on the system board should it not? If Be careful, sometimes components look like they might be broken/manufactured wrongly when they are not.

Highlighted
snoopsmsc
Occasional Visitor

Re: Sparking Motherboard - Frequent HDD Failure Rate - ML350 G5

Ok.  So here's the resolution...

 

So I had run my eyes over the motherboard a couple of times, but I was mainly looking for blown resistors, since I was inside the case at the time of the sparks (leading me to think that the cause of the sparks was some sort of short).

 

Turns out, after the fourth or fifth time, I noticed that one of the capacitors by the memory banks (PBC# C575) was bulging a bit.  Further, after blowing a light coat of dust off the motherboard, I noticed there was also some slight leakage from this capacitor.

 

This might also explain another oddity of this server - the memory modules in this server have always run much hotter than in the other identically configured servers.

 

Anyway, now that the problem has been identified, I know where to go from here.  Thanks for your help!