ProLiant Servers (ML,DL,SL)
1827249 Members
2577 Online
109716 Solutions
New Discussion

Re: ML 350 g5 seems to overheat

 
SOLVED
Go to solution

ML 350 g5 seems to overheat

Since May 14 our ML 350 g5 with CentOS 4.7 has the following problem: occasionally the Internal Health light becomes red and the server is completely blocked. Ping from other hosts fail. Only way to regain control of the server is to press the switch off button for a few seconds and then boot again. The server starts with no problem, every light is green. Then hours or days after the red light is back and the server dies again. From May 14 it's about 10 similar problems; sometimes the server reboots itself, other times dies.
It seems an overheating problem.
Any idea? A hardware malfunction or a cracker who gains control of the server?
28 REPLIES 28
Johan Guldmyr
Honored Contributor

Re: ML 350 g5 seems to overheat

Are you monitoring the temperature?
Is it in fact warm?
Fans not spinning/spinning poorly?

HP has diagnostics you can run, both via proliant support pack/system management homepage and the smartstart CD. Also maybe memory test (memtest86 for example) might be worth a shot.

Checked the ILO IML log?

Would you be able to look for error leds on the system board?

Are there any errors that might explain it in the OS log files?

Re: ML 350 g5 seems to overheat

Thank you for this answer.
My tecnhical aid was there yesterday morning when the server has another block, and told me he felt the server hot, but the fans seemed to work normally. The server booted again with no problem and today there was no critical situation up to now.
I've never checked the IML log, is it possible under CentOS too? I looked the various Linux logs, secure, boot, error but didn't find nothing meaningful.
Tomorrow morning the tecnician will try to use the HP technical tools. Hope to find an answer.

Johan Guldmyr
Honored Contributor

Re: ML 350 g5 seems to overheat

Hi, the ILO (integrated lights out) is separate from the OS. It is a separate NIC on the server and has another IP than the server - it is for remote management (you can for example remotely restart the server from there and see POST errors). If it's not connected I would advise you to connect it, it can be quite useful in these scenarios.
holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

I have the same server and since April 26, the same problem but worse. the internal health light became red and never recovered. still in warranty, a service company discovered that ALL 8 HDD's are busted (he tried the disks in another server with no positive result). Main problem is overheating in the HDD area!
Howeever, on HP advise, in the first case they changed only power supply and Backplane; HP closed case ... but the server still didn't work.
Second case opened: After checking that the HDD's weren't fakes and checking of test protocols etc. HP L2 engineers set action plan to replace 4 (?) of the 8 HDD's and replace Power Supply and Backplane at one shot (again).
Initial and 1 day report shows system failed on all HDDs. after 1 day ADU analysis through L2: "report shows that batterie are fully loaded and parity is still under process and might take some more time to complete. logical drive stus is fine and no action required. pls do send more adu report tomorrow."
After 3 wasted days of testing ... HDD's failed, heating up like a stove.
btw. now 26 days without server...
New case # again ??
action plan 2:"current status is that we have replaced 4 HDD's that was showing failure nad replaced... all HDD's are showing normal green light. However, according to Engr. HDDs are still showing very hot than usual.
I (the new case handler) have now checked with L2 Engr. again and we are now replacing Power supply and Power supply backplane (at the same time) which should solve the issue with HDD overheating!

... OPPS... sounds familiar...?!
Good i have no server since now 29 days, i don't want to blame the guys who are working on this coz it seems this problem shows up very seldom (whereas you might have the same problem!!)so that the L2 guys are obviously clueless right now. question is; is there somebody out who knows about this problem and knows also a solution to the problem??
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

Both of you.
You must take control over the temperature in the server.

Ensure proper inlet temperature.
-Below 25 Dgr. C.
-Ensure hot (Exhaust) air is not looped back.

Ensure proper airflow in the server, is maintained.
- Check all fans are running.
- Check air vents are not obstruckted.
- Do not keep the server running without the lit on.

> Paolo, You need to check if it is the CPU only, that are suffering. That could be due to a broken heatzink (Cooling stuff leaked from the pipes).

>Holger you need to check the inlet temp, and especially do not run without lit.

BR
/jag

Accept or Kudo

holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

Thanks for the advise
However, the server is right now in the HP service center; i can't believe that the guys don't have the right environment for the unit.
It seems this heat problem is internal generated by malfunction and not related to 'improper'cooling of the unit. I assume that the proper cooling of the unit is given in the service center!

Re: ML 350 g5 seems to overheat

The ILO port is a powerful toll and I hope to configurate it on our server tomorrow morning early (not an easy task to find a good moment to stop the server!).
When we bought this server in 2008 I didn't fully realize the value of such a tool! I use HP servers from 15 years and never had a hardware problem before ...

The problem of Holger seems much worse of ours. After every block our server starts again in good health; now it's 2 days it is quiet and working well; that's why we suspected some cracker attack. It happens so sudden!

Maybe it's not the cause, but I see the rear fans very dirty with dust. Is there a safe way to clean them without doing more damage sending dust inside the server?

Re: ML 350 g5 seems to overheat

I have now connected the ILO2. A great tool, but I didn't find out much; here are the lines from the ILO 2 Log concerning the first anda last critical arrest of the server:

Informational iLO 2
05/23/2011 18:04 05/23/2011 18:04 1
Server power restored.

Informational iLO 2
05/23/2011 18:04 05/23/2011 18:04 1
Server power removed.

Informational iLO 2
05/15/2011 09:11 05/15/2011 09:11 1
Server power restored.

Informational iLO 2
05/14/2011 13:50 05/14/2011 13:50 1
Server power removed.

The IML log has no errors; the most recent log is dated August 2009!

In fact I only read that the server was switched off and on. Nothing else. No mention of the red light alert.

Any idea?
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

From the ILo, you also have access to the IML.

Check the IML. That's where you will find the hw errors.

Fan cleaning: and use a vacume cleaner with a brush. Or use a brush only. Or use compressed air.

>Holger, well lets hope they are pro's.


BR
/jag

Accept or Kudo

Re: ML 350 g5 seems to overheat

As I wrote above I read the IML too: but the last IML warning is of August 2009!! So no error is reported for this year!!
Quite surprising!
A sign that some hacker is having fun with our server?

Re: ML 350 g5 seems to overheat

Thinking it over, I'm really puzzled.
Is it possible that IML service does not report a single caution for 2 years?
Is it possible thar IML does not report a critical "red light" situation?
Is it possible that IML itself is not working properly? Maybe disabled in some way?
Johan Guldmyr
Honored Contributor

Re: ML 350 g5 seems to overheat

Have you had any problems in the last two years though? Not all problems are reported in there either.

Re: ML 350 g5 seems to overheat

Absolutely no hw problem from purchase (2008). But there were 4 caution in the first months.

Rather I searched the ILO2 logs, everything is reported Ok, Fans are declared OK in the summary, but in detail fans 7, 8 (I/O board zone) are reported "failed" and temperature in that zone is 47° C. See the cut&paste below. Is it normal?

Summary:
Fans: Ok; Not Redundant
Temperatures: Ok
VRMs: Ok
Power Supplies: Ok; Not Redundant

Fans:
Location Status Speed
Fan 1: System Zone Ok 35%
Fan 2: System Zone Not Installed n/a
Fan 3: System Zone Ok 35%
Fan 4: System Zone Not Installed n/a
Fan 5: CPU 1 Ok 35%
Fan 6: CPU 2 Not Installed n/a
Fan 7: I/O Board Zone Failed n/a
Fan 8: I/O Board Zone Failed n/a

Temperature:
Location Status Reading Thresholds
Temp 1: Ambient Zone Ok 27C Caution: 40C; Critical:45C
Temp 2: Memory Zone Ok 56C Caution: 110C; Critical:120C
Temp 3: CPU 1 Ok 36C Caution: 100C; Critical:100C
Temp 4: CPU 1 Ok 36C Caution: 100C; Critical:100C
Temp 5: I/O Board Zone Ok 47C Caution: 63C; Critical:68C
Temp 6: CPU 2 n/a n/a Caution: 100C; Critical:100C
Temp 7: CPU 2 n/a n/a Caution: 100C; Critical:100C
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

Sorry missed that. I only notised you have look in the ILo event log.

"Is it possible that IML service does not report a single caution for 2 years?"

Yes. ProLiants are rock solid.

"Is it possible thar IML does not report a critical "red light" situation?"

Yes, some errors that come early during post, might not be logged.

Is it possible that IML itself is not working properly? Maybe disabled in some way?
Not likely

I don't know if CentOS is a supportet OS.
But if it is, and if insight agents is available, then it could be helpfull to install it. Those tools also log in the IML.

It does make you problem a bit harder to solve.

Ensure proper cooling. and do check the heatzink (try to have a temp reading on the CPU)

BR
/jag

Accept or Kudo

holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

hope so too.
some interesting developments, will keep you updated!
thanks

Re: ML 350 g5 seems to overheat

I thank you for help and for very useful tips, but the problem is there again; today there 2 server power removed/server power restored troubles recorded by ILO and this evening while I was working on the MySQL database from home, the server went suddenly down, this time without restore; the iLO system status has these lines

System Health: Unknown
Internal Health LED: Ok
Server Power: STANDBY (OFF)
UID Light: OFF

and for power

Present power reading: 0 Watts at 20:04:09, 05/27/2011

I tried to restart the server, first with the "Press and hold" button to turn it completely off, but it doesn't work, the Internal Health LED is allways Ok.
"Momentary Press" is also useless.
So I cannot restart the server remotely. The IML does not report anything, the ILO log reports only my trials. Seems quite desperate!

We have already copied all important data on a new server we had recently bought, which will take over next week, but I need to understand what is happening to this server. Last resource will be to send it to HP!
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

Yes, it seem like a good idea to get a techie on it.

BR
/jag

Accept or Kudo

Re: ML 350 g5 seems to overheat

Today having moved all important data to a new server I made a test between the 2 servers: the back was hotter than the new one; the power supply was really hot to touch, while the new server supply was quite fresh: the old server fans seemed more weak and hot than the new ones.
We powered down the server, opened it, had a good clean up, there was dust, but not so much; the we found that the hottest thing was the power supply; maybe the two little fans were rotten? Luckily I had an identical power supply in stock and we made the change.
Now temperatures inside are 4-5 degrees less than before. Let's wait some day ...
But the question is: is it possible that this hot power supply originated all those sudden critical "power removed" events?
holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

update:
HP finally send engr. to check and test my server so that they believe what the HP partner service told them the whole time.... now they changed SPS-Power supply; -Drive Cage; -DC Converter and -van. server is running now since 48 hours without getting hot again ...
the power supply has definitely something to do with the problem...
keep you updated!

Re: ML 350 g5 seems to overheat

Three days have passed and our server, having changed the power supply, is absolutely quiet.
Let's wait some other day before lifting the "red alert", but, adding Holger's trouble to it, the conclusion looks one: Achille's heel of this ML 350 g5 server is the power supply!

We had already a minor problem with it; a few days after buying the server October 2008 the power supply became extremely noisy, like an airplane and we asked for a spare part from HP. That's the reason we had a 2nd supply in stock. Now the old and noisy power supply is in place and up to now not noisy and not hot.

The question is: is there a new revised power supply for these servers?
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

Yes, now you mention PSU.
There's an old issue on PSU's, and it could be the ML350 G5 is affected.

BR
/jag

Accept or Kudo

Re: ML 350 g5 seems to overheat

A week after changing the PSU absolutely no problem with the server. No doubt the PSU was the cause of the trouble.
But touching the PSU I feel it still hot, not so hot as the previuous, but the PSU of our new ML 350 g6 is quite fresh.
Now, is it better to have a new PSU from HP (after 2 1/2 years should they give it free?).
I searched the HP forums and found a lot of similar troubles with ML350 g5 power supplies: sudden noise, server arrest ... e.g. http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/NOISY-FAN-on-ML350-G5/m-p/4332257#M86822

I've still the doubt: did HP produce a more reliable PSU for the g5? Or is it just a matter of firmware?

gregersenj
Honored Contributor
Solution

Re: ML 350 g5 seems to overheat

There's 3 years warrenty.
So if the PSU is faulty, you will get a new/refurbised/modified.

The 3 year warrenty is from the invoice date.

BR
/jag

Accept or Kudo

holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

Hello there
just to keep you informed; finally my server is back (after 8 weeks repair).

the heat problem is for sure a problem of the power supply! in the returned unit they exchanged finally 8 HDD's; DVD rom, Tape player, cable converter, sata cable, redundant fan, sata HD cage, psp backplane and the PSP itself.

on the way to this result they destroyed 4 additional HDD's, 2 backplanes, 2 fans, 2 SPS and another backplane...
YES there is a 3 years warranty and yes, i'm lucky, its still due for 4 month...
So Paolo you should really give your server to HP and they should run full diagnostic, and exchange everything under doubt.
I hope my server will make it; if yes, by by guys, this is my last mail. Thanks again for your support and compassion.