System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

Unexplained Reboots - DL385 RHEL4 AS x86_64

 
Richard Jones_7
Occasional Visitor

Unexplained Reboots - DL385 RHEL4 AS x86_64

We have two DL385s that are appear to be rebooting randomly every few days. They are running latest RH and HP updates, drivers and firmware. The following appears in the logs everytime this happens.

May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 hpasmd[3730]: WARNING: hpasmd: ASR Lockup Detected: (casm device driver alerted)
May 2 06:43:39 shutdown: shutting down for system reboot
May 2 06:43:40 init: Switching to runlevel: 6


The system then reboots cleanly. Hardware diagnostics and additional memory testing all pass OK. The reboots do not appear to be related to load as they have occurred when the systems have been idle.

28 REPLIES
Vipulinux
Respected Contributor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hi
Looking at the logs it seems a RAM issue, do you use diff brands of RAM on the server. It can also be if you are using a 2 diff size RAM in some cases.

Try swapping RAM and see if that makes a difference. If you just have 1 RAM then try using another one.

Cheers
Vipul
Steven E. Protter
Exalted Contributor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Shalom,

I've never seen messages like this, but don't use 64 bit Linux yet.

I'd bet on a memory issue or you may need to patch the system.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
TJ Toedebusch
Occasional Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

We have seen this and HP support tells us that the NMI is almost always a memory issue. Reseat the memory and I would suggest running memtest and/or SmartStart for diagnostics.

We had HP come in and swap memory to fix it for us.

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I had this error on a DL385 with RHEL4 AS x86_64. It rebooted with this same NMI error twice over the span of 4-5 days.

I ran memtest on it overnight and after 13+ passes, no errors were found. I reseated all DIMMs, rebooted into the OS and I'm waiting to see if it happens again.

I just had a second DL385 reboot this morning with this error. I've reseated the DIMMS and I'm running memtest now. I'm guessing no errors will be found.

These are brand new machines and luckily not in production yet, but I'm hoping I don't have a bad run of RAM chips on my hands, or another problem such as system board or CPU.
Steve Burt_1
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hi There,

I am having the same issues with 2 Brand New DL385 64bit Servers. Any interesting stuff that arise from raising a call with HP and Redhat, I will post.. :-)

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

The second system to have this problem has been running Memtest for over 50 hours wall time, 30 passes and 0 errors.

I'm hoping that reseating the DIMMs was sufficient to correct the problem.
Matthew J Warrick
Frequent Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Does the IML list a correctable memory error threshold reached or exceeded?

Probably just some bad RAM... we just deployed about 150 dl385s across several customer sites and haven't seen any pervasive memory issues so far.
"Did you get that memo?"

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

The only entry in the IML log is:

ASR Lockup Detected: (casm device driver alerted)

No specific reference to a memory problem is made by the IML, only the NMI error reported by the kernel.

Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Could you tell me what version of "HP System Health Application and Insight Management Agents for Red Hat Enterprise Linux 4" you are using?
Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

By the way, If you feel better, you are not alone; we have the exactly same situation with SLES9 Kernel: 2.6.5-7.244 and two DL385 (2 dual core CPU 2.4/1MB, redundant power/fan, two FCA2214, 32GB RAM)
Richard Jones_7
Occasional Visitor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

We are using HP System Health Application and Insight Management Agents for Red Hat Enterprise Linux 4 version 7.5.0-184.RHEL4 from Support Pack 7.50

Will try Support Pack 7.52 which includes an updated System Health Application and Insight Management Agents (7.5.1-8.RHEL4). Although I don't see any mention of the problems we are seeing in the release notes.
Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

One more thing, we try to reproduce situation on other DL385 server with one dual core CPU and 16GB of memory, and seems that server running without problems by now.
Steve Burt_1
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I have just put down the phone to HP and they have informed me that this is a known problem with the Insight Manager Health Driver. That reboots the server and never captures any logs that are of help.

The Current work around is to shut the server down and turn off ASR in the BIOS, meanwhile HP are looking to produce a fix for this.


Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I had installed the Proliant Support Pack 7.51 when I set these up.

All eight systems are DL385 with Dual-Core Opteron 270 CPUs. Some have a single DC CPU and some have dual DC cpus. The servers that have had this error have dual DC CPUs.

I just discovered a third machine had this issue.

I didn't know a new version of the Proliant Support Pack was out. I will try updating that.

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

A collegue has suggested that the 7.40 version of the Proliant Support Pack has been more reliable.

I'm either going to downrev to that or remove the PSP altogther.
Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

We sart with psp 7.40 :(
Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Any sucess info about using Support Pack 7.52?
David Kennamer
Occasional Visitor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

We had the same issue with DL385's and RHEL AS x86_64. The only workaround we found was to disable ASR. Hopefully HP will fix this soon.

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Same problem on our DL385 with RHEL 4 x86_64. Will keep you informed about any fixes or (additional) helping hints, provided by HP.
Mark Addinall
Occasional Visitor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64


We have the same issue on two of my new DL385s. AMD64 Opteron, RedHat ES 4.2 and HP Toolset 7.4.

I'll follow this thread.

Ta,
Mark.
Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hello again,

I have received following suggestion from HP:
"1. In the /etc/sysconfig/powersave/common file, replace this line
POWERSAVE_CPUFREQD_MODULE=""
with
POWERSAVE_CPUFREQD_MODULE="off"
2. Reboot the system"

Since that systems up and running (uptime 11 days).

Might be it helps.

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hello,

after deactivating the ASR, System is up and running for 8 days without any errors.

The solution provided by our HP support contact is to apply the latest pro liant support pack 7.52 where some of related issues where solved.

- Andreas -

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I don't think PSP 7.52 is going to be any better. PSP 7.52 contains the same version of the hpasm driver as the 7.51 version, and that seems to be the problematic driver.

Disabling the ASR Lockup Detection in the BIOS does prevent the errors and the reboot from occurring. I'll stick with the workaround for now.
Walt McDaniel
Occasional Contributor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I'd be real interested in knowing if anyone is seeing the same issue on AS 3.0 on the dual-core Opterons. We are seeing numerous ASR reboots across our 100+ dual-core opterons and the only thing in the hplog is ASR Detected by System Rom. There is nothing in the system log to indicate there is a system problem