HPE EVA Storage

2012FC - Critical Error: Fault Type: NMI

 
Rob_69_1
Frequent Advisor

2012FC - Critical Error: Fault Type: NMI

Hi all, I had a problem recently with my 2012FC SAN storage, and apparently it caused the system crash of the 2 Virtualization Servers relying on it (all the VM Guests onboard have been crashing too, of course).

Can anybody help me out finding the root cause of the problem?

I'm hereby pasting a few lines from the Events Log from the SAN.

Thank you.

C/W Date/Time EC ESN Message
Warning 2010-07-22 22:00:51 112 A19680 Host link down Chan0
Warning 2010-07-22 22:00:48 112 A19676 Host link down Chan1
Warning 2010-07-22 22:00:48 112 A19674 Host link down Chan0
Warning 2010-07-22 22:00:42 112 B12936 Host link down Chan1
Warning 2010-07-22 22:00:42 112 B12935 Host link down Chan0
Critical 2010-07-22 21:58:31 107 A19660 Critical Error: Fault Type: NMI p1: 01AD3B3 p2:01AB9D3 p3: 01ADE74 p4:01AA740 CThr: Cache
Warning 2010-07-22 21:57:49 112 B12914 Host link down Chan1
Warning 2010-07-22 21:57:49 84 B12910 Killed partner controller; reason=3 (Heartbeat lost)
Warning 2010-07-22 21:57:49 112 B12913 Host link down Chan0
4 REPLIES 4
Rob_69_1
Frequent Advisor

Re: 2012FC - Critical Error: Fault Type: NMI

Additional information:

the array has 2 controllers

firmware version is J200P39

I saw similar issues on the forum, but the "CThr:" part is different from my case (everybody else had 1 single controller, and CThr was "host")

By the way, what does CThr mean?

Any help much appreciated
Rob_69_1
Frequent Advisor

Re: 2012FC - Critical Error: Fault Type: NMI

After speaking to HP, they tell me that the reason of the failure is unknown, and that they would NOT scale up the case because my firmware was not at the latest version at the time the issue happened.

So, I now upgraded the Firmware in both my controllers, but honestly I do not hope the issue to show up again, even if I would now be eligible for scaling up the case (?)

The systmes have been down for a few minutes and then recovered "by themselves". Nevertheless it's a bit annoying not knowing what has been going on. Apart the fact that both controllers went down.

I hereby paste a wider excerpt from the Events Log, still hoping some of the Experts in the Forum will come up with an insight.

BTW, the Critical Event is in the bottom part:
"Critical 2010-07-22 21:58:31 107 A19660 Critical Error: Fault Type: NMI p1: 01AD3B3 p2:01AB9D3 p3: 01ADE74 p4:01AA740 CThr: Cache "

Thank you...

----------------
Info 2010-07-22 22:00:52 310 B12943 Discovery and initialization of enclosure data has completed following a rescan.
Warning 2010-07-22 22:00:51 112 A19680 Host link down Chan0
Info 2010-07-22 22:00:50 111 B12942 Host link up Chan1: 2 Loop IDs, Fabric
Info 2010-07-22 22:00:50 111 B12941 Host link up Chan0: 2 Loop IDs, Fabric
Info 2010-07-22 22:00:50 111 A19679 Host link up Chan0: 1 Loop ID
Info 2010-07-22 22:00:50 111 A19678 Host link up Chan1: 2 Loop IDs, Fabric
Info 2010-07-22 22:00:48 19 B12940 Rescan bus done. Reason Code: 24. Found 9 drives, 1 Drive Enclosure
Info 2010-07-22 22:00:48 19 A19677 Rescan bus done. Reason Code: 24. Found 9 drives, 1 Drive Enclosure
Warning 2010-07-22 22:00:48 112 A19676 Host link down Chan1
Info 2010-07-22 22:00:48 19 A19675 Rescan bus done. Reason Code: 24. Found 9 drives, 1 Drive Enclosure
Warning 2010-07-22 22:00:48 112 A19674 Host link down Chan0
..
Info 2010-07-22 22:00:47 111 A19673 Host link up Chan1: 1 Loop ID
Info 2010-07-22 22:00:47 111 A19672 Host link up Chan0: 1 Loop ID
...
Info 2010-07-22 22:00:44 72 A19668 Recovery initiated for Controller A
Info 2010-07-22 22:00:43 73 A19667 Heartbeat detected from the other RAID controller
Warning 2010-07-22 22:00:42 112 B12936 Host link down Chan1
Warning 2010-07-22 22:00:42 112 B12935 Host link down Chan0
Info 2010-07-22 22:00:42 72 B12934 Recovery initiated for Controller A
Info 2010-07-22 22:00:42 73 B12933 Heartbeat detected from the other RAID controller
Info 2010-07-22 22:00:42 195 B12932 Auto-writethrough trigger event: partner processor recovered
Info 2010-07-22 22:00:42 195 A19666 Auto-writethrough trigger event: partner processor recovered
Info 2010-07-22 22:00:41 33 B12931 Time/date has been changed to 2010-07-22 22:00:42
Info 2010-07-22 22:00:41 112 A19665 Host link down Chan1
Info 2010-07-22 22:00:41 112 A19664 Host link down Chan0
...
Info 2010-07-22 22:00:21 56 A19661 Storage Controller booted. SC code version: J200P39
Info 2010-07-22 22:00:04 81 B12925 Kill line released automatically (allow other RAID controller to boot)
Critical 2010-07-22 21:58:31 107 A19660 Critical Error: Fault Type: NMI p1: 01AD3B3 p2:01AB9D3 p3: 01ADE74 p4:01AA740 CThr: Cache
Info 2010-07-22 21:58:16 206 B12924 Scrub Vdisk started (Vdisk: GlistenDisk02, SN: 00c0ffd7af590000d7707c4a00000000)
Info 2010-07-22 21:58:08 310 B12923 Discovery and initialization of enclosure data has completed following a rescan.
Info 2010-07-22 21:58:04 111 B12922 Host link up Chan1: 3 Loop IDs, Fabric
Info 2010-07-22 21:58:04 111 B12921 Host link up Chan0: 3 Loop IDs, Fabric
Info 2010-07-22 21:58:04 71 B12920 Failover completed, failover set A
...
Info 2010-07-22 21:57:50 114 B12915 Drive link down Chan0
Warning 2010-07-22 21:57:49 112 B12914 Host link down Chan1
Warning 2010-07-22 21:57:49 112 B12913 Host link down Chan0
Info 2010-07-22 21:57:49 71 B12912 Failover initiated, failover set A
Info 2010-07-22 21:57:49 194 B12911 Auto-writethrough trigger event: partner processor down
Warning 2010-07-22 21:57:49 84 B12910 Killed partner controller; reason=3 (Heartbeat lost)
----------------
Johnny Fish
New Member

Re: 2012FC - Critical Error: Fault Type: NMI

I know this is an old thread, but this just happened to me too, and we're running the latest firmware. HP says "it happens".

A979 2011-03-24 16:37:13 84 W A Killed partner controller. (reason: Heartbeat lost [failover reason code: 3])

Has not happened in the last year except for last night, so if it's a very rare event I guess I can live with it.
predrag81
Valued Contributor

Re: 2012FC - Critical Error: Fault Type: NMI

Hi Rob,

Please can you give me some informations about your problem you had in the past with MSA.
I'm dealing with exactly the same problem these days.

Did you solved problem and how?

KR,
Predrag