ProLiant Servers - Netservers
1753918 Members
7378 Online
108810 Solutions
New Discussion юеВ

DL140 spontaneous shutdown

 
SOLVED
Go to solution
Derek_72
New Member

DL140 spontaneous shutdown

We have about 10 DL140's at work with a proprietary NIC in them.

When running serious stress tests on the NICs, the DL140s (no other machine types affected) would spontaneously shut down about 10s into the test. Often this is preceeded by one or two bursts where the cooling fans speed up for half a second or so and then slow down again.

Reading about the problems with BMC and IPMI on these machines, (which by the way are running RedHat Enterprise 3), I upgraded both BMC and BIOS flash.

After the upgrade it now takes between 10 and 15 minutes for the machines to switch off, but switch off they still do. I have also loaded ipmi.o and tried to use shutdown_watchdog, (which, thank you HP gives no messages at all about what if anything it's doing, not even in the system log), but this last action does not seem to make a difference to the shutdown
behaviour.

The only other pieces of evidence I have is that each time the system is shut down there is a message in the event log saying

Temperature 59 4C 4B

Or other hex digits following.

I also have a logic analyser on the PCI bus and there is no unusual activity, parity or system error reported, just reset being asserted and then the power going off.

The problem only arises when the PCI bus and processors are heavily loaded.

I don't buy the temperature argument as the fans kick in for such a small period of time. Other times the system will shut off without blipping the fans at all.

Is anyone able to decode the hex digits ? What do they mean ?

Any other ideas what might be doing this, or how to get it fixed ?


10 REPLIES 10
CA859951
Honored Contributor

Re: DL140 spontaneous shutdown

Derek:

Tempurature events can be controlled in the bios ASR events. You might try setting the ASR Thermal events and see if the problem goes away. I am surprised you do not see the POST message about a thermal event when the server boots up. You may need to contact HP directly and see if they have a better answer.

G'luck! -john
"Now is the only thing that's real!"
Derek_72
New Member

Re: DL140 spontaneous shutdown

Tempurature events can be controlled in the bios ASR events. You might try setting the ASR Thermal events and see if the problem goes away. I am surprised you do not see the POST message about a thermal event when the


How do I configure these things ? I see no controls in the BIOS for anything like this.

Thanks
Derek
CA859951
Honored Contributor
Solution

Re: DL140 spontaneous shutdown

Derek:

very sorry about that... I was working from memory on this. Other ProLiant servers have the option to configure ASR events in the bios. The DL140 appears to be like all the other 100 series servers in that the bios does not have it. It does appear to have a Watchdog Timer but I do not think you can control it.
http://h200001.www2.hp.com/bc/docs/support/UCR/SupportManual/TPM_349109-002/TPM_349109-002.pdf
page 42

You are running the IPMI "Heath Driver"?
http://h18023.www1.hp.com/support/files/server/us/download/20414.html
for update 1 or later.

As a temporary fix (test) ... put a room fan in front of the server to move more air through the box. If the load tests run longer or complete altogether, contact HP and let them know. They may be able to work this into their issue tracker and get a fix for you.
"Now is the only thing that's real!"
Derek_72
New Member

Re: DL140 spontaneous shutdown

Processors are really cool to the touch and the machines are in an aircon'd room at 20C.

I tried using the IPMI driver but it didn't seem to make much difference.

Only consistent thing is it keeps dropping messages in the event log about over temp, but the fans change speed for such a small amount of time I can't believe this is real.

I can only conclude that there's some sort of intermittent intereference in the temp measurement circuitry which is fooling the BMC into thinking the processors are in meltdown.
CA859951
Honored Contributor

Re: DL140 spontaneous shutdown

Contact HP Tech Support:
http://welcome.hp.com/country/uk/en/contact_us.html

Let them know of the condition and get a support ticket started. They may have some suggestions off the bat that may help, but be sure to get a ticket started. They may warranty the boards if they know if an issue they have not published.

G'luck! -john
"Now is the only thing that's real!"
CA1204973
New Member

Re: DL140 spontaneous shutdown

Hi Derek,

We've been dealing with this exact same problem with an order of 12 DL140s. However, we do not have a proprietary NIC in our systems. They are HP's standard DL140s with 2 processors and no CD-ROM.

We opened a support ticket with HP but have not received any resolution to the problem. The only additional information I have (that you may know by now) is that the "Critical Temperature Threshold" is only 60 degrees! Too bad we can't change (or even disable) this "feature."

I would greatly appreciate hearing of any success you have with this.

Thanks,
Devin
CA1157687
New Member

Re: DL140 spontaneous shutdown

This appears to be the same problem I am having.

As long as the jobs being run have disk IO, there is no problem. But, I have software which loads a design into memory and then does computing. This runs the CPUs at a higher load. I can crash a DL140 consistanly.

This is using HP memory (brand new).

Has anyone gotten a fix?

My company policy is to use HP whenever possible and I need to get 8 more machines.

Thanks
CA1204973
New Member

Re: DL140 spontaneous shutdown

Hi Joel (and Derek),

It looks like HP released a BMC firmware update on 11/25 that might help.
http://h18007.www1.hp.com/support/files/server/us/download/22197.html
(I don't know why it isn't listed on the DL140 "Software and Drivers" page ...)
I just updated our systems today, so hopefully we'll see if it helps. Let me know if you already applied it or if it you see any improvement.

Thanks,
Devin

ps. If you haven't already seen it, there's a BIOS update from 11/25 also.
CA1157687
New Member

Re: DL140 spontaneous shutdown

Thank you Devin. This appears to have fixed the problem.

I just wish HP could have had this fixed back in August.