ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ASR Detected by System ROM

 
Frederic Barrat
Occasional Visitor

ASR Detected by System ROM

We have a proliant ML370 G2 with Windows 2000 Server SP4, which reboots randomly a couple of time without reason during the 2006 year, curiously it's every time around the same time 10:04PM (but not the same day of the week for example)

1 ASR Detected by System ROM 10/23/2006 10:06PM 10/24/2006 2:56PM 2
2 Blue Screen Trap (BugCheck, STOP: 0x00000050 (0xFFFFFFF5, 0x00000000, 0x80451758, 0x00000000)) 10/23/2006 10:04PM

Maybe someone have a idea of what I can do in order to fix this problem.

regards

Frédéric
4 REPLIES
Daniel Leblanc
Honored Contributor

Re: ASR Detected by System ROM

What is you're BIOS
the current one is(2004.05.01):
http://h18023.www1.hp.com/support/files/server/us/locate/20_1341.html#2

i would try flashing it(even if it the current one) and if you have Home page management verify the log's.
Daniel Leblanc
Honored Contributor

Re: ASR Detected by System ROM

FYI:

The ProLiant Automatic Server Recovery (ASR) is a feature that causes the server to restart when catastrophic operating system error occurs, such as Windows 2000 Server SP4 panic. A system fail-safe timer, the ASR timer, starts when hpasm, the hp Advanced System Management driver, is loaded.

The hp Server Management Drivers and Agents (hpasm) come as part of the ProLiant Support Pack (PSP) for Windows 2000 Server SP4. After hpasm is loaded, it sets the ASR timer to the ASR timeout value. The default is 10 minutes, but can be changed in the ProLiant ROM-Based Setup Utility (RBSU). If the timer is not reset within the specified time, it is presumed that an operating system fault has occurred. After the timer has expired, it will trigger an interrupt, which initiates a system reset.


The following text was copied from the hpasm man page.

-----------------------------------------------

HP ProLiant Automatic Server Recovery (ASR) Feature
The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The hpasm driver frequently reloads the counter to prevent it from counting down to zero. If the ASR timer counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Events which may con- tribute to the operating system locking up include:

* A peripheral device (such as a PCI adapter) failing in such a way that numerous spurious interrupts are generated.

* A high priority software application consumes all the available CPU cycles and does not allow the operating system scheduler to run the ASR timer reset process.

* A software or kernel application consumes all available memory including the virtual memory space (i.e. swap). This may cause the operating system scheduler to cease functioning.

* A critical operating system component such as a file system fails and causes the operating system scheduler to cease func- tioning.

* There are certain Linux kernels which will lock up in the "wait_on_irq" function under heavy network activity. Addition- ally, earlier releases of the Linux EXT3 file systems were known to cause the Linux operating system to cease scheduling for extended periods of time. These types of issues will cause the Linux kernel to stop scheduling processes and effectively lock up the system. The Hewlett-Packard Company continues to work closely with our Linux operating system partners to quickly identify and resolve these types of issues.

* Any other event besides an ASR timeout which causes a Non-Mask- able Interrupt (NMI) to be generated.

The ProLiant ASR feature is a hardware based timer. If a true hardware failure occurs, the ProLiant Advanced Server Management driver might not be called but the server will be reset as if the power switch was pressed. The ProLiant ROM code may log an event to the ProLiant Inte- grated Management Log (IML) when the server reboots.

The ProLiant Advanced Server Management driver is notified via a Non-Maskable Interrupt (NMI). If possible, the driver will attempt to perform the following actions:

* Displays a message on the console stating the problem

* Makes an entry in the ProLiant Integrated Management Log (IML).

* Attempts to gracefully shutdown the operating system to close the file systems.

There is not a guarantee that the operating system will gracefully shutdown. This depends on the type (software or hardware) and severity of the error condition. There is more information about the ProLiant Advanced Server Recovery (ASR) feature later on in this document.


Using the ASR timer as a Windows 2000 Server SP4 debug tool
The ASR timer will generate a Non-Maskable Interrupt (NMI) a few sec- onds before the ProLiant server is reset. The HP ASM driver will be called directly by the processor and will attempt to source the cause of the NMI.

The HP ProLiant Automatic Server Recovery (ASR) process will log a mes- sage that the ASR has been initated, attempt to force normal Linux shutdown and if the Windows 2000 Server SP4 shutdown is successful, the HP Proliant Sys- tems Management driver will log a message to the IML indicating a good shutdown. The HP ProLiant ROM will check a status bit on the ASM hard- ware to see if an ASR event took place and will log a message to the IML as such.

The first message to be logged to the IML will be: "ASR Lockup Detected: (casm device driver alerted)". This message indicates that the NMI handler code of the hpasm driver was able to execute. If this message is not present but the "ASR Detected by System ROM" message IS present, this is an indication that the NMI handler code of the hpasm driver was not able to execute. The two primary events that prevent the hpasm NMI handler from executing are:

* An uncorrectable ("double-bit") ECC memory error has occurred in the memory area occupied by the hpasm driver. You can try mov- ing the memory around to different slots to see if you can iso- late the issue to a particular DIMM.

* A critical PCI or Processor error has occurred. This could stop either memory fetches or processor instructions from being exe- cuted.

In most cases when only the "ASR Detected by System ROM" message is logged to the IML, the problem is usually an uncorrectable ECC memory error. If both messages are logged to the IML, this is usually an indication of a software (e.g. Linux Kernel issue) lockup. The ASR event is always a reaction to another event that has caused the Linux scheduler to stop executing. Using tools such as "sar" in conjunction with enabling the CASM_NMI_DEBUG code can assist in making a determina- tion of what may be creating the conditions to generate an ASR event.
Daniel Leblanc
Honored Contributor

Re: ASR Detected by System ROM

Stop 0x00000050 or PAGE_FAULT_IN_NONPAGED_AREA


This Stop message, also known as Stop 0x50, occurs when requested data is not found in memory. The system generates a fault, which normally indicates that the system looks for data in the paging file. In this circumstance, however, the missing data is identified as being located within an area of memory that cannot be read to disk. The system faults, but cannot find, the data and is unable to recover. Faulty hardware, a buggy system service, antivirus software, and a corrupted NTFS volume can all generate this type of error.

Interpreting the Message
The four parameters listed in the message are defined in order of appearance as follows:

1.
Virtual address that caused the fault

2.
Type of access (0 = read operation, 1 = write operation)

3.
If not zero, the instruction address that referenced the address in parameter 1

4.
Opaque information about the stop, interpreted by the kernel


Resolving the Problem
Faulty hardware. Stop 0x50 usually occurs after the installation of faulty hardware or in the event of failure of installed hardware (usually related to defective RAM, be it main memory, L2 RAM cache, or video RAM). If hardware has been added to the system recently, remove it to see if the error recurs. If existing hardware has failed, remove or replace the faulty component. You need to run hardware diagnostics supplied by the system manufacturer. For details on these procedures, see the owner's manual for your computer.

Buggy system service. Often, the installation of a buggy system service is a culprit. Disable the service and confirm that this resolves the error. If so, contact the manufacturer of the system service about a possible update. If the error occurs during system startup, restart your computer, and press F8 at the character-mode screen that displays the prompt "For troubleshooting and advanced startup options for Windows 2000, press F8." On the resulting Windows 2000 Advanced Options menu, choose the Last Known Good Configuration option. This option is most effective when only one driver or service is added at a time.

Antivirus software. Antivirus software can also trigger this error. Disable the program and confirm that this resolves the error. If it does, contact the manufacturer of the program about a possible update.

Corrupted NTFS volume. A corrupted NTFS volume can also generate this error. Run Chkdsk /f /r to detect and repair disk errors. You must restart the system before the disk scan begins on a system partition. If you cannot start the system due to the error, use the Recovery Console and run Chkdsk /r. For more information about the Recovery Console, see "Troubleshooting Tools and Strategies" in this book. If the hard disk is a SCSI disk, check for problems between the SCSI controller and the disk.

Warning If your system partition is formatted with the FAT16 file system, the long file names used by Windows 2000 can be damaged if Scandisk or another MS-DOS-based hard disk tool is used to verify the integrity of your hard disk from an MS-DOS prompt. (An MS-DOS prompt is typically derived from an MS-DOS startup disk or from starting MS-DOS on a multiboot system.) Always use the Windows 2000 version of Chkdsk on Windows 2000 disks.

Finally, check the System Log in Event Viewer for additional error messages that might help pinpoint the device or driver that is causing the error. Disabling memory caching of the BIOS might also resolve this error.

Microsoft periodically releases a package of product improvements and problem resolutions for Windows 2000 called a Service Pack. Because many problems are resolved by installing the latest Service Pack, it is recommended that all users install them as they become available. To check which Service Pack, if any, is installed on your system, click Start, click Run, type winver, and then press ENTER. The About Windows 2000 dialog box displays the Windows version number and the version number of the Service Pack, if one has been installed.

Occasionally, remedies to specific problems are developed after the release of a Service Pack. These remedies are called hotfixes. Microsoft does not recommend that you install a postâ Service Pack hotfix unless the specific problem it addresses has been encountered. Service Packs include all of the hotfixes released since the release of the previous Service Pack. The status of hotfix installations is not indicated in the About Windows 2000 dialog box. For more information about Service Packs and hotfixes, see "Additional Resources" at the end of this chapter.

For more troubleshooting information about the 0x50 Stop message, refer to the Microsoft Knowledge Base link, using the keywords winnt and 0x00000050. For information about this resource, see "Additional Resources" at the end of this chapter

http://www.microsoft.com/technet/prodtechnol/Windows2000Pro/reskit/part7/proch33.mspx?mfr=true
sandeep_raman
Honored Contributor

Re: ASR Detected by System ROM

Hello Frédéric,

Disable ASR option in RBSU and see if server is stable.
We'll think of other options based on the above.

SRH