Operating System - Linux
1752761 Members
5122 Online
108789 Solutions
New Discussion юеВ

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

 
Richard Jones_7
New Member

Unexplained Reboots - DL385 RHEL4 AS x86_64

We have two DL385s that are appear to be rebooting randomly every few days. They are running latest RH and HP updates, drivers and firmware. The following appears in the logs everytime this happens.

May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 kernel: You probably have a hardware problem with your RAM chips
May 2 06:43:39 hpasmd[3730]: WARNING: hpasmd: ASR Lockup Detected: (casm device driver alerted)
May 2 06:43:39 shutdown: shutting down for system reboot
May 2 06:43:40 init: Switching to runlevel: 6


The system then reboots cleanly. Hardware diagnostics and additional memory testing all pass OK. The reboots do not appear to be related to load as they have occurred when the systems have been idle.

28 REPLIES 28
Vipulinux
Respected Contributor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hi
Looking at the logs it seems a RAM issue, do you use diff brands of RAM on the server. It can also be if you are using a 2 diff size RAM in some cases.

Try swapping RAM and see if that makes a difference. If you just have 1 RAM then try using another one.

Cheers
Vipul
Steven E. Protter
Exalted Contributor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Shalom,

I've never seen messages like this, but don't use 64 bit Linux yet.

I'd bet on a memory issue or you may need to patch the system.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
TJ Toedebusch
Occasional Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

We have seen this and HP support tells us that the NMI is almost always a memory issue. Reseat the memory and I would suggest running memtest and/or SmartStart for diagnostics.

We had HP come in and swap memory to fix it for us.

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

I had this error on a DL385 with RHEL4 AS x86_64. It rebooted with this same NMI error twice over the span of 4-5 days.

I ran memtest on it overnight and after 13+ passes, no errors were found. I reseated all DIMMs, rebooted into the OS and I'm waiting to see if it happens again.

I just had a second DL385 reboot this morning with this error. I've reseated the DIMMS and I'm running memtest now. I'm guessing no errors will be found.

These are brand new machines and luckily not in production yet, but I'm hoping I don't have a bad run of RAM chips on my hands, or another problem such as system board or CPU.
Steve Burt_1
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Hi There,

I am having the same issues with 2 Brand New DL385 64bit Servers. Any interesting stuff that arise from raising a call with HP and Redhat, I will post.. :-)

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

The second system to have this problem has been running Memtest for over 50 hours wall time, 30 passes and 0 errors.

I'm hoping that reseating the DIMMs was sufficient to correct the problem.
Matthew J Warrick
Frequent Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Does the IML list a correctable memory error threshold reached or exceeded?

Probably just some bad RAM... we just deployed about 150 dl385s across several customer sites and haven't seen any pervasive memory issues so far.
"Did you get that memo?"

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

The only entry in the IML log is:

ASR Lockup Detected: (casm device driver alerted)

No specific reference to a memory problem is made by the IML, only the NMI error reported by the kernel.

Algimantas
Advisor

Re: Unexplained Reboots - DL385 RHEL4 AS x86_64

Could you tell me what version of "HP System Health Application and Insight Management Agents for Red Hat Enterprise Linux 4" you are using?