ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

HP Proliant DL360 G4 iLO feature

 
SOLVED
Go to solution
Nickolay_1
Occasional Contributor

HP Proliant DL360 G4 iLO feature

Hello!
Is there anyway to run delayed shutdown for the server from iLO interface?
I need it to emulate watchdog functionality, because no working watchdog timer in DL360.
6 REPLIES
Stephen Kebbell
Honored Contributor
Solution

Re: HP Proliant DL360 G4 iLO feature

Hi,

what do you want to do? Do you want the server to automatically restart after the OS has hung/crashed? All ProLiants have Automatic Server Recovery (ASR), configurable in the BIOS. You also need the system management driver installed in your OS.

Regards,
Stephen
Nickolay_1
Occasional Contributor

Re: HP Proliant DL360 G4 iLO feature

I download and install hpasm-7.3.0c-67.rhel4.i386.rpm, and now, in the process list i see running daemon, but no kernel modules loaded.
And i can't find cpqasm.o and cpqevt.o in the rpm.
Where i can get it? As i understand, without them, i can't use ASR functionality.

This is ps -faxwu output:

root 7818 0.4 0.1 24856 1056 ? S 02:36 0:01 hpasmd
root 7823 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7824 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7874 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7876 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7881 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7882 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7883 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7884 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7885 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7886 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7887 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
root 7888 0.0 0.1 24856 1056 ? S 02:36 0:00 \_ hpasmd
Ross Minkov
Esteemed Contributor

Re: HP Proliant DL360 G4 iLO feature


I always install the whole PSP (ProLiant Support Pack). All the versions of the agents & drivers in the PSP are known to work together.

For the delayed shutdown -- is that a Linux or Windows server?

-Ross
Ross Minkov
Esteemed Contributor

Re: HP Proliant DL360 G4 iLO feature


OK, I just saw it -- Linux server.

In this case for a delayed shutdown use:

/sbin/shutdown -h time

The time argument can have different formats. First, it can be an absolute time in the format hh:mm, in which hh is the hour (1 or 2 digits) and mm is the minute of the hour (in two digits). Second, it can be in the format +m, in which m is the number of minutes to wait. The word now is an alias for +0.


HTH,
Ross
Nickolay_1
Occasional Contributor

Re: HP Proliant DL360 G4 iLO feature

Hello Ross!

You don't understand me right! :)
I need hardware feature for delayed shutdown(watchdog timer).
Ross Minkov
Esteemed Contributor

Re: HP Proliant DL360 G4 iLO feature


Nickolay,

OK, How about this? -- The ProLiant Automatic Server Recovery (ASR) is a feature that causes the server to restart when catastrophic operating system error occurs, such as Linux panic. A system fail-safe timer, the ASR timer, starts when hpasm, the hp Advanced System Management driver, is loaded.

The hp Server Management Drivers and Agents (hpasm) come as part of the ProLiant Support Pack (PSP) for Linux. After hpasm is loaded, it sets the ASR timer to the ASR timeout value. The default is 10 minutes, but can be changed in the ProLiant ROM-Based Setup Utility (RBSU). If the timer is not reset within the specified time, it is presumed that an operating system fault has occurred. After the timer has expired, it will trigger an interrupt, which initiates a system reset.


The following text was copied from the hpasm man page.

-----------------------------------------------

HP ProLiant Automatic Server Recovery (ASR) Feature
The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The hpasm driver frequently reloads the counter to prevent it from counting down to zero. If the ASR timer counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Events which may con- tribute to the operating system locking up include:

* A peripheral device (such as a PCI adapter) failing in such a way that numerous spurious interrupts are generated.

* A high priority software application consumes all the available CPU cycles and does not allow the operating system scheduler to run the ASR timer reset process.

* A software or kernel application consumes all available memory including the virtual memory space (i.e. swap). This may cause the operating system scheduler to cease functioning.

* A critical operating system component such as a file system fails and causes the operating system scheduler to cease func- tioning.

* There are certain Linux kernels which will lock up in the "wait_on_irq" function under heavy network activity. Addition- ally, earlier releases of the Linux EXT3 file systems were known to cause the Linux operating system to cease scheduling for extended periods of time. These types of issues will cause the Linux kernel to stop scheduling processes and effectively lock up the system. The Hewlett-Packard Company continues to work closely with our Linux operating system partners to quickly identify and resolve these types of issues.

* Any other event besides an ASR timeout which causes a Non-Mask- able Interrupt (NMI) to be generated.

The ProLiant ASR feature is a hardware based timer. If a true hardware failure occurs, the ProLiant Advanced Server Management driver might not be called but the server will be reset as if the power switch was pressed. The ProLiant ROM code may log an event to the ProLiant Inte- grated Management Log (IML) when the server reboots.

The ProLiant Advanced Server Management driver is notified via a Non-Maskable Interrupt (NMI). If possible, the driver will attempt to perform the following actions:

* Displays a message on the console stating the problem

* Makes an entry in the ProLiant Integrated Management Log (IML).

* Attempts to gracefully shutdown the operating system to close the file systems.

There is not a guarantee that the operating system will gracefully shutdown. This depends on the type (software or hardware) and severity of the error condition. There is more information about the ProLiant Advanced Server Recovery (ASR) feature later on in this document.


Using the ASR timer as a Linux debug tool
The ASR timer will generate a Non-Maskable Interrupt (NMI) a few sec- onds before the ProLiant server is reset. The HP ASM driver will be called directly by the processor and will attempt to source the cause of the NMI.

The HP ProLiant Automatic Server Recovery (ASR) process will log a mes- sage that the ASR has been initated, attempt to force normal Linux shutdown and if the Linux shutdown is successful, the HP Proliant Sys- tems Management driver will log a message to the IML indicating a good shutdown. The HP ProLiant ROM will check a status bit on the ASM hard- ware to see if an ASR event took place and will log a message to the IML as such.

The first message to be logged to the IML will be: "ASR Lockup Detected: (casm device driver alerted)". This message indicates that the NMI handler code of the hpasm driver was able to execute. If this message is not present but the "ASR Detected by System ROM" message IS present, this is an indication that the NMI handler code of the hpasm driver was not able to execute. The two primary events that prevent the hpasm NMI handler from executing are:

* An uncorrectable ("double-bit") ECC memory error has occurred in the memory area occupied by the hpasm driver. You can try mov- ing the memory around to different slots to see if you can iso- late the issue to a particular DIMM.

* A critical PCI or Processor error has occurred. This could stop either memory fetches or processor instructions from being exe- cuted.

In most cases when only the "ASR Detected by System ROM" message is logged to the IML, the problem is usually an uncorrectable ECC memory error. If both messages are logged to the IML, this is usually an indication of a software (e.g. Linux Kernel issue) lockup. The ASR event is always a reaction to another event that has caused the Linux scheduler to stop executing. Using tools such as "sar" in conjunction with enabling the CASM_NMI_DEBUG code can assist in making a determina- tion of what may be creating the conditions to generate an ASR event.

-------------------------

HTH,
Pointless Ross :(