Operating System - Linux
1829905 Members
1602 Online
109993 Solutions
New Discussion

Re: HP Insight Management Agents Trap Alarm

 
Jim Serio
Occasional Contributor

HP Insight Management Agents Trap Alarm

I have the latest Management Agents installed on a RH 7.3 box. It has been running fine for the last 5 months, however, I came in tonight to find the system had been rebooted (discovered by the "crash" next to my login from earlier in the day). Syslog showed no problems, other than the system seemed to be down for over 30 minutes. Odd.

Then I checked my email and found the following message:

-----
Subject: HP Insight Management Agents Trap Alarm

Trap-ID=6025

An 'ASR Recover Complete' trap signifies that the system has been shutdown by the ASR feature and has just become operational again.
-----

There was also an identical message sent at the same time with a diferent subject line: HP Agent Trap Alert

I'm unable to find ANY log on this machine that suggests what the problem was. Does anyone have an idea, or maybe point me to somewhere to find the answer?

Jim
6 REPLIES 6
Jared Middleton
Frequent Advisor

Re: HP Insight Management Agents Trap Alarm

Jim,
What kind of server do you have?
Have you changed/upgraded anything recently?

I have two HP/Compaq DL 580 G2 servers that used to be plagued with Automatic Server Reboot (ASR) at random times. There are probably many many potential conflicts that could cause this symptom though. It may take some serious diagnostic work to narrow down YOUR exact cause.

My particular problem had to do with an APIC issue - some conflict between the system BIOS and the kernel revision level - I believe due to HP updates lagging behind Red Hat patches. In my case, the solution was to add the "noapic" option to the GRUB boot loader config file, and then reboot. No more strange reboots in last six months.

It took me a few weeks to arrive at that solution because of several "wild goose chases" and the fact that it sometimes takes days to test whether a particular suggestion solved the problem.

Happy detective work!

Regards,
Jared
Fred G. Claypool, Jr.
Frequent Advisor

Re: HP Insight Management Agents Trap Alarm

Oh my goodness! We've searched high-and-low for anyone else who's experienced this problem, and this is the first time we've found anyone else.

We've had this problem for quite some time as well. We've been saving the generated emails during the last couple of spontaneous reboots: 8/27/2003, 9/14/2003, and 10/15/2003. There may be a correlation regarding the use of a certain tape drive, but that's only a theory right now.

This machine is an HP (Compaq) ML530 G2 running SuSE Linux Enterprise Server 7 for IA-32.

I'll update this message if I discover any additional clues. If either of you, please email me too. Thanks!
Experience gained while correcting a previous mistake is the best teacher imaginable!
Steven_94
New Member

Re: HP Insight Management Agents Trap Alarm

I have a DL380 G1 that ha been running RH 7.2 and never had a problem, until this morning a weird reboot took place and I found the 2 e-mails noting this exact error.

Will have to check for BIOS updates and see if that might be the issue. The only thing that chnages on this server is doing the updates provided by RH.
Fred G. Claypool, Jr.
Frequent Advisor

Re: HP Insight Management Agents Trap Alarm

Everyone,

We have a workaround (or a fix?) to this problem. I've found two postings regarding this topic, so I'll post this response in both places.

Our system administrator remembers reading somewhere that Processor HyperThreading may cause issues with certain devices. We disabled Processor HyperThreading (via BIOS) on 10/29/2003. Prior to this, we had experienced random "spontaneous reboots" while backing up our Informix database -- in some cases, as many as three reboots during one backup attempt cycle. Since disabling the Processor HyperThreading, however, we've not experienced any other "spontaneous reboots" and all of our backups have performed just fine.

We're considering this a success right now, but if anything changes on our side, we'll update this note.

Thanks.
Experience gained while correcting a previous mistake is the best teacher imaginable!
Olivier Drouin
Trusted Contributor

Re: HP Insight Management Agents Trap Alarm

Maybe the temperature is the trouble ? I know this is a codition for ASR to reboot a server.
Ross Minkov
Esteemed Contributor

Re: HP Insight Management Agents Trap Alarm

The Automatic Server Recovery is implemented using a "heartbeat" timer

that continually counts down. The hpasm driver frequently reloads the

counter to prevent it from counting down to zero. If the ASR timer

counts down to 0, it is assumed that the operating system is locked up

and the system automatically attempts to reboot. Events which may con-

tribute to the operating system locking up include:



* A peripheral device (such as a PCI adapter) failing in such a

way that numerous spurious interrupts are generated.



* A high priority software application consumes all the available

CPU cycles and does not allow the operating system scheduler to

run the ASR timer reset process.



* A software or kernel application consumes all available memory

including the virtual memory space (i.e. swap). This may cause

the operating system scheduler to cease functioning.



* A critical operating system component such as a file system

fails and causes the operating system scheduler to cease func-

tioning.



* There are certain Linux kernels which will lock up in the

"wait_on_irq" function under heavy network activity. Addition-

ally, earlier releases of the Linux EXT3 file systems were known

to cause the Linux operating system to cease scheduling for

extended periods of time. These types of issues will cause the

Linux kernel to stop scheduling processes and effectively lock

up the system. The Hewlett-Packard Company continues to work

closely with our Linux operating system partners to quickly

identify and resolve these types of issues.



* Any other event besides an ASR timeout which causes a Non-Mask-

able Interrupt (NMI) to be generated.


HTH,
Ross