Operating System - Linux
1752807 Members
5670 Online
108789 Solutions
New Discussion юеВ

Re: Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

 
vertisunix
Occasional Advisor

Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

Two dual core 2.6Ghz DL385 servers running RHEL4 U4/x86_64 are crashing continously.

The recurring kernel messages are:
--
CPU 2: Machine Check Exception: 4 Bank 4: f64a20024e080813
TSC 19e06d8282d ADDR 1f26b8750
Kernel panic - not syncing: Machine check
--
Already tried the following things:
- Firmware update (HP FW7.70)
- Replace memory
- Tried a non-SMP kernel
- Tried non-SMP/SMP kernel on 1 CPU
- Running with and without PSP 7.70
- Swap cpu's with a similar box
- Memtest86+
- HP Diagnostics

None of them fixed my problems.
Any ideas?
4 REPLIES 4
vertisunix
Occasional Advisor

Re: Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

Justin_99
Valued Contributor

Re: Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

Sounds like you have covered most everything. See these errors quite frequently on our systems when it has bad memory or cpu(s). In the cases where replacing those doesn't resolve the problem we mark the system and RMA the whole thing back to the vendor for a new one. The marks are to prevent getting the same parts back which has happened.
Tobu
Occasional Advisor

Re: Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

Looks like hardware problem with MCH/North bridge. Replace the Motherboard is easiest option .
vertisunix
Occasional Advisor

Re: Machine Check Exception, DL385, Dual Core, RHEL4 x86_64

FYI: I have been playing a lot with memory DIMM's and memory stress testing.

Hpasm doesn't catch the ECC errors, thus not writing them into IML. HP Diagnostics hung before reporting anything.

I found a couple of broken DIMM's and replaced them. The problem seems to be solved and the systems are running stable now.

Going to report the weakness of hpasm to HP aswell. Thanks for all the replies.