HPE EVA Storage

MSA 2012i RAID controller failed

 
Bart_Rajchel
Advisor

MSA 2012i RAID controller failed

Hi, Everyone!
Brand new MSA 2012i installed two weeks ago. Updated the firmware for both controllers, configured a simple raid 5 with 14 drives, everything worked just fine till we added few more drives and started extending the virtual disk. Within first few hours of expending Raid Controller B failed. After restarting it failed again within next few hours. We got a brand new one from HP. Just to make sure that this is not just a one time deal we waited over the weekend to see if it happens again. Monday morning I noticed that Controller A failed with exactly same errors.
Here are all critical errors we got from the event log

C 05-30 14:07:11 314 B403 FRU type: RAID IOM A, problem: encl 0. Product ID: AJ748A, S/N: 3CL811R021 rev: 27. Related event ID: 402, type: 313
C 05-30 14:07:11 313 B402 RAID controller A failed, reason PCIE link recovery failed. Product ID , S/N
C 05-30 14:06:14 107 A414 Critical Error: Fault Type: Divide By 0 p1: 02FC16D p2:0000000 p3: 0000000 p4:0000000 CThr: CT_Morph
W 05-30 14:06:06 84 B390 Killed partner controller; reason=29 (PCIE link recovery failed)
W 05-27 13:58:13 1 A372 Vdisk critical: Storage01, SN: 00c0ffd53c0a0048876f2d4800000000
C 05-21 01:46:06 314 A303 FRU type: RAID IOM B, problem: encl 0. Product ID: AJ748A, S/N: 3CL811R060 rev: 27. Related event ID: 302, type: 313
C 05-21 01:46:06 313 A302 RAID controller B failed, reason PCIE link recovery failed. Product ID , S/N
C 05-21 01:45:07 107 B268 Critical Error: Fault Type: Divide By 0 p1: 02FC16D p2:0000000 p3: 0000000 p4:0000000 CThr: CT_Morph
W 05-21 01:44:59 84 A290 Killed partner controller; reason=29 (PCIE link recovery failed)
C 05-20 09:28:25 314 A282 FRU type: A/C PSU, Right, problem: encl 1. Product ID: 481320-001, S/N: 3CL804P022 rev: A. Related event ID: 281, type: 168

We have the newest firmware on both controllers.
Any help will be appreciated
9 REPLIES 9
John Kufrovich
Honored Contributor

Re: MSA 2012i RAID controller failed

Hello Bart,

We have seen this problem with single controller configurations but not with dual controller configs.

If you have an open case with us(HP) please ask them to elevate the case. I'll take a look at the logs.

Regards,
jk
Bart_Rajchel
Advisor

Re: MSA 2012i RAID controller failed

I was about to call HP, but other projects take priority for now. I am a little concern about making it available to users. I have never seen a device having so many issues. What is the purpose of dual controller if both can fail within few hours.
Bart_Rajchel
Advisor

Re: MSA 2012i RAID controller failed

Thanks!
Calling HP.
I will post the results.
Case # 3602029050
John Kufrovich
Honored Contributor

Re: MSA 2012i RAID controller failed

Hey Bart,

Can you put Controller A p0 and Controller B p0 on the same subnet. And p1 from each controller on a different subnet.

jk
Bart_Rajchel
Advisor

Re: MSA 2012i RAID controller failed

I wasn't using those ports at all but had them on same subnet.
Bryan Mroch
New Member

Re: MSA 2012i RAID controller failed

Hey bart, what exactly was the fix for your issue?
I feel I am having the same or similar issue on MSA 2012i as well with dual controllers,
this morning controller B failed/locked up and had to pull it from chassis and re-insert it to get unit responding.
But I have not updated firmware yet.

C 01-13 10:34:08 314 A383 FRU type: RAID IOM B, problem: encl 0. Product ID: AJ748A, S/N: ********* rev: C. Related event ID: 382, type: 313
C 01-13 10:34:08 313 A382 RAID controller B failed, reason PCIE link recovery failed. Product ID , S/N
01-13 10:33:18 310 A381 Discovery and initialization of enclosure data has completed following a rescan.
01-13 10:33:14 111 A380 Host link up Chan1: 0 Loop IDs
01-13 10:33:14 111 A379 Host link up Chan0: 0 Loop IDs
01-13 10:33:13 19 A378 Rescan bus done. Reason Code: 24. Found 12 drives, 1 Drive Enclosure
01-13 10:33:08 71 A377 Failover completed, failover set B
01-13 10:33:08 77 A376 Cache initialized for RAID controller B. WB data found
01-13 10:33:08 19 A375 Rescan bus done. Reason Code: 2. Found 12 drives, 1 Drive Enclosure
C 01-13 10:33:07 107 B655 Critical Error: OSMEnterDebugger p1: 02074A1 p2:02073D0 p3: 0113B24 p4:0113907 CThr: Serial_1, DbgRegNum=255
Bryan Mroch
New Member

Re: MSA 2012i RAID controller failed

Edit/addition:
Controller A p0 and Controller B p0 are already on same subnet and p1 on each on different subnet.
Bart_Rajchel
Advisor

Re: MSA 2012i RAID controller failed

Bryan,
I went through few things including firmware updates, windows 2003 updates iSCSI HBA was also updated with new firmware and drivers. I am almost sure that separating the subnets between the ports did the job. I am sure that you know what you are doing otherwise you wouldn't be playing with iSCSI :)
But this is what I have sometimes itâ s good to see a picture.

Controller A
Managment Port 10.1.16.100/255.255.0.0 (production network)

p0 192.168.0.50/255.255.255.0
p1 192.168.10.50/255.255.255.0

Controller B
Managment Port 10.1.16.101/255.255.0.0 (production network)
p0 192.168.0.51/255.255.255.0
p1 192.168.10.51/255.255.255.0
So essentially we have two storage subnets and one production subnet to keep the storage traffic unaffected by our regular network traffic. It has been months since I made those changes and it's running like a champ since then. Email me with your phone number if you have any questions we can talk.
brajchel@kmrrec.org
DeepScamp
Occasional Visitor

Re: MSA 2012i RAID controller failed

Hello there.

In 2014 we have the same problem.

Only one of two controllers works. When i turn msa2012i on, for example, controller A works, controller B - failed, reason pcie link recovery failed. When i reboot from GUI both controllers, controller B is ON and controller A failed reason pcie link recovery failed...

What is it?

I tried firmwares J212R10-03, J212P01-01, J210P23-02...


sometimes instead "reason pcie link recovery failed." i see "boot handshake timeout"

Singly these controllers work fine, but together... :(

Ports 0 ob both controllers in the same subnet, ports 1 - in another.