ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Weird restart system after restarting HP utilities

CAS_2
Valued Contributor

Weird restart system after restarting HP utilities

Hi

 

After restarting some services as hpilo, hp-health and hpsmd, out HP ProLiant DL380 G6 restarted abruptly, i.e., the system was power reset (no shutdown). The log in the ILO shows:

 

Severity      Class Last Update      Initial Update   Count Description

Informational iLO 2 03/06/2013 17:41 03/06/2013 17:41 1     Server power restored.

Informational iLO 2 03/06/2013 17:41 03/06/2013 17:41 1     Server power removed.

Caution       iLO 2 03/06/2013 17:41 03/06/2013 17:41 1     Server reset.

Informational iLO 2 03/06/2013 17:41 03/06/2013 17:41 1     BMC IPMI Watchdog Timer Timeout: Action=System Power Reset.

 

Do you think the system power reset was caused by the restart of those services?

The people who restarted the services swear the restart of those services does not reset the system power reset.

 

Thanx in advance

1 REPLY
Matti_Kurkela
Honored Contributor

Re: Weird restart system after restarting HP utilities

I would say that the reset was caused by not restarting the appropriate services, or at least not restarting them quickly enough.

 

This log event is probably the key here:

BMC IPMI Watchdog Timer Timeout: Action=System Power Reset.

 

A "watchdog timer" that shows up in iLO log is a hardware element that forces the server to reset if the OS seems to be hung. It is a very simple hardware timer that counts down from a set value and resets the server when it reaches zero. An accompanying software component (probably in the hp-health driver) periodically keeps resetting the timer back to the start value, so it should not ever reach zero as long as the OS and the software component remain functional.

 

If the software component is prevented from running for any reason for a long enough time, the hardware timer will reach zero and will trigger a system power reset.

 

In older ProLiant models, this was known as ASR (Automatic Server Recovery) and the default time was about 10 minutes if I recall correctly.

 

There probably is a way for the software component to "disarm" the timer when the software component is intentionally unloaded, but for some reason that did not happen. Anyway, even if the timer could not be disarmed, there should have been several minutes of time to restart the services before the watchdog timer triggered a reset. Maybe the person that actually did the work was unaware of the existence of  a watchdog timer and e.g. took a coffee break while the service was down?

 

I understand that you're working with second-hand information ("The people who restarted the services swear..."), but knowing the exact actions taken and the sequence of events would be important here.

 

The operating system and hp-health version information might be good to know too, in case your particular version has known bugs related to the watchdog timer.

MK