ProLiant Servers (ML,DL,SL)
1753513 Members
5054 Online
108795 Solutions
New Discussion юеВ

Unexpected reboots of servers

 
jberner
New Member

Unexpected reboots of servers

Hi all,

I have a problem with some DL380 G4 servers running on Windows Server 2003 SP1+patches.

These servers reboot sometimes and I have no idea why. Below you find the entry in the system event log. I searched the system for the mentioned MIB file but didn't find it. Can someone please explain what this message exactly?

Event Type: Information
Event Source: Server Agents
Event Category: Events
Event ID: 1090
Date: 01.03.2006
Time: 00:28:41
User: N/A
Computer: SDxxxxxxx
Description:
System Information Agent: Health: The server is operational again. The server has previously been shutdown by the Automatic Server Recovery (ASR) feature and has just become operational again.
[SNMP TRAP: 6025 in CPQHLTH.MIB]

Bye

Jochen
4 REPLIES 4
Igor Karasik
Honored Contributor

Re: Unexpected reboots of servers

Jochen,
Any errors in HP integrated management log viewer?
Do you have last server firmware and last Proliant support pack installed?
Try to disable ASR as well, maybe when ASR disabled you will have STOP error
Mark Aspinall
New Member

Re: Unexpected reboots of servers


I have just had the same problem with ML370 G3 running Windows Server 2003.
I have looked through nemerous posts and comments regarding unexpected ASR , but have not found an appropriate answer.
Is it a problem with hardware, software, bios, config?. Where do I start to diagnose?
Have HP got an answer to this related problem?
There is no info in HP Management Log Viewer, just ASR error.

Many Thanks
Mark
jberner
New Member

Re: Unexpected reboots of servers

Hi all,

sorry for the delay. I've een a bit bit busy for the last days.

@Igor: No errors, support pack and firmware are up to date. We can't disable ASR, because the servers are used in big retail stores, that need the servers 24/7. If I had a STOP error previous to the ASR restart it would be mentioned in the event log.

But maybe you can point me to some documentation about what ASR exactly does an how it does what it does. Point is: most of the reboots seem to happen between 0:00 and 01:30 in the night. During that period the server performs some heavy duty stuff (DB dump, batches that eat up lots of CPU time and memory etc.). My suggestion is that the reboot has to do with that but I can't prove due to lack of knowlegde about the way ASR works.

Then again the reboots do not happen on all servers and not every night, so maybe I'm wrong.

Bye and thanks for your help

JB
kris rombauts
Honored Contributor

Re: Unexpected reboots of servers

Hi Jochen,

here is some description of ASR but from the Linux OS point of view but the mechanism is the same for any OS.

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c00060592&jumpid=reg_R1002_USEN

Basically their is a agent that resets a counter circuitry in hardware and if the OS hangs, this reset does not occur and the counter will reach zero after a while (i.e. 10 minutes) and the circuitry hard resets the server via a hardware mechanism and logs an entry in the hardware log.

Be aware that after 20 reboots due to ASR , the server will not reboot into the OS untill maually reset again. This cannot be changed, see under TASKS, Autorecovery in the System Home Page and check the help page their for more detail. The count is only reset to zero again when you manually reset the server (so no reset due a ASR event).


So from what i read above it looks like the system is experiencing a hang and ASR kicks in. It would be good to check to what value the timer is set, in the help i see this:

"Timeout displays how many minutes ASR will wait before initiating a recovery process. ASR depends on the software support to routinely notify the ASR hardware that the server is operating properly.

To change the timeout setting, use the System Configuration Utility. The time you specify for this field should be a prudent period of time before resetting the system and activating the recovery process after a fault occurs. If the timeout period is set too low on a heavily utilized server, the timeout could occur before the software support has time to service the timer."


This timer setting can be changed in the F9 BIOS setup utility at power up, if it is set to low maybe the health agent cannot get the priority to reset the hardware timer in due time, so eventually increase the timer value and monitor again.


HTH

Kris