ProLiant Servers (ML,DL,SL)
1753445 Members
5258 Online
108794 Solutions
New Discussion юеВ

Re: ML 350 g5 seems to overheat

 
SOLVED
Go to solution

ML 350 g5 seems to overheat

Since May 14 our ML 350 g5 with CentOS 4.7 has the following problem: occasionally the Internal Health light becomes red and the server is completely blocked. Ping from other hosts fail. Only way to regain control of the server is to press the switch off button for a few seconds and then boot again. The server starts with no problem, every light is green. Then hours or days after the red light is back and the server dies again. From May 14 it's about 10 similar problems; sometimes the server reboots itself, other times dies.
It seems an overheating problem.
Any idea? A hardware malfunction or a cracker who gains control of the server?
28 REPLIES 28
Johan Guldmyr
Honored Contributor

Re: ML 350 g5 seems to overheat

Are you monitoring the temperature?
Is it in fact warm?
Fans not spinning/spinning poorly?

HP has diagnostics you can run, both via proliant support pack/system management homepage and the smartstart CD. Also maybe memory test (memtest86 for example) might be worth a shot.

Checked the ILO IML log?

Would you be able to look for error leds on the system board?

Are there any errors that might explain it in the OS log files?

Re: ML 350 g5 seems to overheat

Thank you for this answer.
My tecnhical aid was there yesterday morning when the server has another block, and told me he felt the server hot, but the fans seemed to work normally. The server booted again with no problem and today there was no critical situation up to now.
I've never checked the IML log, is it possible under CentOS too? I looked the various Linux logs, secure, boot, error but didn't find nothing meaningful.
Tomorrow morning the tecnician will try to use the HP technical tools. Hope to find an answer.

Johan Guldmyr
Honored Contributor

Re: ML 350 g5 seems to overheat

Hi, the ILO (integrated lights out) is separate from the OS. It is a separate NIC on the server and has another IP than the server - it is for remote management (you can for example remotely restart the server from there and see POST errors). If it's not connected I would advise you to connect it, it can be quite useful in these scenarios.
holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

I have the same server and since April 26, the same problem but worse. the internal health light became red and never recovered. still in warranty, a service company discovered that ALL 8 HDD's are busted (he tried the disks in another server with no positive result). Main problem is overheating in the HDD area!
Howeever, on HP advise, in the first case they changed only power supply and Backplane; HP closed case ... but the server still didn't work.
Second case opened: After checking that the HDD's weren't fakes and checking of test protocols etc. HP L2 engineers set action plan to replace 4 (?) of the 8 HDD's and replace Power Supply and Backplane at one shot (again).
Initial and 1 day report shows system failed on all HDDs. after 1 day ADU analysis through L2: "report shows that batterie are fully loaded and parity is still under process and might take some more time to complete. logical drive stus is fine and no action required. pls do send more adu report tomorrow."
After 3 wasted days of testing ... HDD's failed, heating up like a stove.
btw. now 26 days without server...
New case # again ??
action plan 2:"current status is that we have replaced 4 HDD's that was showing failure nad replaced... all HDD's are showing normal green light. However, according to Engr. HDDs are still showing very hot than usual.
I (the new case handler) have now checked with L2 Engr. again and we are now replacing Power supply and Power supply backplane (at the same time) which should solve the issue with HDD overheating!

... OPPS... sounds familiar...?!
Good i have no server since now 29 days, i don't want to blame the guys who are working on this coz it seems this problem shows up very seldom (whereas you might have the same problem!!)so that the L2 guys are obviously clueless right now. question is; is there somebody out who knows about this problem and knows also a solution to the problem??
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

Both of you.
You must take control over the temperature in the server.

Ensure proper inlet temperature.
-Below 25 Dgr. C.
-Ensure hot (Exhaust) air is not looped back.

Ensure proper airflow in the server, is maintained.
- Check all fans are running.
- Check air vents are not obstruckted.
- Do not keep the server running without the lit on.

> Paolo, You need to check if it is the CPU only, that are suffering. That could be due to a broken heatzink (Cooling stuff leaked from the pipes).

>Holger you need to check the inlet temp, and especially do not run without lit.

BR
/jag

Accept or Kudo

holger holst
Occasional Advisor

Re: ML 350 g5 seems to overheat

Thanks for the advise
However, the server is right now in the HP service center; i can't believe that the guys don't have the right environment for the unit.
It seems this heat problem is internal generated by malfunction and not related to 'improper'cooling of the unit. I assume that the proper cooling of the unit is given in the service center!

Re: ML 350 g5 seems to overheat

The ILO port is a powerful toll and I hope to configurate it on our server tomorrow morning early (not an easy task to find a good moment to stop the server!).
When we bought this server in 2008 I didn't fully realize the value of such a tool! I use HP servers from 15 years and never had a hardware problem before ...

The problem of Holger seems much worse of ours. After every block our server starts again in good health; now it's 2 days it is quiet and working well; that's why we suspected some cracker attack. It happens so sudden!

Maybe it's not the cause, but I see the rear fans very dirty with dust. Is there a safe way to clean them without doing more damage sending dust inside the server?

Re: ML 350 g5 seems to overheat

I have now connected the ILO2. A great tool, but I didn't find out much; here are the lines from the ILO 2 Log concerning the first anda last critical arrest of the server:

Informational iLO 2
05/23/2011 18:04 05/23/2011 18:04 1
Server power restored.

Informational iLO 2
05/23/2011 18:04 05/23/2011 18:04 1
Server power removed.

Informational iLO 2
05/15/2011 09:11 05/15/2011 09:11 1
Server power restored.

Informational iLO 2
05/14/2011 13:50 05/14/2011 13:50 1
Server power removed.

The IML log has no errors; the most recent log is dated August 2009!

In fact I only read that the server was switched off and on. Nothing else. No mention of the red light alert.

Any idea?
gregersenj
Honored Contributor

Re: ML 350 g5 seems to overheat

From the ILo, you also have access to the IML.

Check the IML. That's where you will find the hw errors.

Fan cleaning: and use a vacume cleaner with a brush. Or use a brush only. Or use compressed air.

>Holger, well lets hope they are pro's.


BR
/jag

Accept or Kudo