HPE EVA Storage

MSA2012i Connection Issue with failed Drive

 
inti
Advisor

MSA2012i Connection Issue with failed Drive

Hi,

we had a HDD-Failure some days ago on a MSA2012i of one of our customers.
Basically everything worked as it should, error got reported, we got a mail-notification and (global) spare kicked in.

The only thing disturbing me is, that right before the failure there seemed to be a connection loss for about 2-3 Minutes (I got iSCSIPrt - Errors an a attached Host).

In this particular Installation we use a Dual-Controller Model and have 1 vDisk assigned to each Controller with different Servers attached. However the failed disk was a member of the vDisk on the second controller and the connection issue occured at both controllers. Is this normal behaviour ?

Regards,

IntI
9 REPLIES 9
Wickedsunny
Valued Contributor

Re: MSA2012i Connection Issue with failed Drive

This is certainly not normal. A failed HDD should not cause the Host connectivity to be affected unless a VDISk fails..

In you cases the Vdisk will go to critical state and then use the spare to rebuild.

Post Logs

Regards,
Sunny
inti
Advisor

Re: MSA2012i Connection Issue with failed Drive

It looks like this in the Event-Log:


B396 04-20 18:04:35 33 I B Time/date has been changed to 2009-04-20 18:04:34
B397 06-05 13:21:29 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 0463860f 0008
B398 06-05 13:21:43 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 11d9e280 0080
B399 06-05 13:21:43 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 11d9e280 0080
B400 06-05 13:22:43 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:7 additional
B401 06-05 13:22:43 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 04638600 000f
B402 06-05 13:23:07 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:5 additional
B403 06-05 13:23:07 8 W B Vdisk GWT drive down (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10)
B404 06-05 13:23:07 314 C B FRU type: drive, problem: encl 0 deviceID 10. Vendor: SEAGAT Product ID: ST3450856SS , S/N: 3QQ0L0ZX000099200V9K rev: 0004. Related event ID: 403, type: 8
B405 06-05 13:23:07 1 W B Vdisk critical: GWT, SN: 00c0ffd73e0a0000b9bcc74900000000
B406 06-05 13:23:07 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 11d9e280 0080
B407 06-05 13:23:07 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 04638600 000f
A509 06-05 13:23:18 59 I A Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:000000000000
B408 06-05 13:23:20 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:7 additional
B409 06-05 13:23:20 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): Protocol Error cdb:Rd 04638600 000f
B410 06-05 13:23:30 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): Protocol Error cdb:5 additional
B411 06-05 13:23:30 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Rd 002f802f 0008
A510 06-05 13:23:36 59 I A Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
A511 06-05 13:23:36 59 I A Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:000000000000
B412 06-05 13:23:58 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:7 additional
B413 06-05 13:23:58 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Wr 00000001 0001
B414 06-05 13:24:12 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:Wr 00000001 0001
A512 06-05 13:24:50 59 I A Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
A513 06-05 13:24:50 59 I A Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:1b0000000100
B415 06-05 13:24:54 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B416 06-05 13:24:54 9 I B Spare kicked in (Channel:0 ID:8, SN:3QQ0MKD5000099213ED4 Encl:0 Slot:8) for critical Vdisk (Vdisk: GWT, SN: 00c0ffd73e0a0000b9bcc74900000000)
B417 06-05 13:24:54 37 I B Vdisk reconstruct started (Vdisk: GWT, SN: 00c0ffd73e0a0000b9bcc74900000000) drive: Channel:0 ID:8 SN:3QQ0MKD5000099213ED4 Encl:0 Slot:8
A514 06-05 13:26:58 59 I A Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
A515 06-05 13:26:58 59 I A Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:1b0000000100
A516 06-05 13:28:03 59 I A Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
A517 06-05 13:28:03 19 I A Rescan bus done. Reason Code: 4. Found 7 drives, 1 Drive Enclosure
B418 06-05 14:07:01 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B419 06-05 14:07:39 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B420 06-05 14:07:39 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
A518 06-05 14:33:18 33 I A Time/date has been changed to 2009-06-05 15:32:23
B421 06-05 14:33:18 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B422 06-05 14:33:18 33 I B Time/date has been changed to 2009-06-05 15:32:23
B424 06-05 15:07:04 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B425 06-05 15:07:42 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B426 06-05 15:07:42 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
A519 06-05 15:42:52 33 I A Time/date has been changed to 2009-06-05 14:43:48
B423 06-05 15:42:52 33 I B Time/date has been changed to 2009-06-05 14:43:48
B427 06-05 16:07:06 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B428 06-05 16:07:06 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B429 06-05 16:07:45 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B430 06-05 16:07:45 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B431 06-05 16:49:41 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B432 06-05 16:49:41 18 I B Vdisk reconstruct completed successfully (Vdisk: GWT, SN: 00c0ffd73e0a0000b9bcc74900000000)
B433 06-05 17:07:08 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B434 06-05 17:07:47 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B435 06-05 17:07:47 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B436 06-05 18:07:10 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B437 06-05 18:07:10 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B438 06-05 18:07:49 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B439 06-05 18:07:49 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200
B440 06-05 19:07:12 59 I B Disk channel error (Channel:0 ID:138 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:3 additional
B441 06-05 19:07:12 59 I B Disk channel error (Channel:0 ID:10 SN:3QQ0L0ZX000099200V9K Encl:0 Slot:10): I/O Timeout cdb:030000001200

.. and so on .. the Channel Errors continue to occur every Hour. Disk is no longer visible on the enclorue view. Replacement Disk already arrived, however we did not yet insert it.
It you need more Logs (complete Support-Logs?), just throw a line.
Wickedsunny
Valued Contributor

Re: MSA2012i Connection Issue with failed Drive

No Not required. These are good enuff..

What firmware are we running on the Controllers?
inti
Advisor

Re: MSA2012i Connection Issue with failed Drive

Hi .. they are running the latest Version I know of:

Controller A Versions
---------------------
Storage Controller CPU Type : Celeron 566MHz
Storage Controller Firmware : J210P12
Storage Controller Memory : F300R22
Storage Controller Loader : 15.010
Management Controller Firmware: W421R20
Management Controller Loader : 12.013
Expander Controller Firmware : 3022
CPLD Revision : 27
Hardware Revision : LCA 56
Host Interface Module : 50
Host Interface Module Model : 2

Controller B Versions
---------------------
Storage Controller CPU Type : Celeron 566MHz
Storage Controller Firmware : J210P12
Storage Controller Memory : F300R22
Storage Controller Loader : 15.010
Management Controller Firmware: W421R20
Management Controller Loader : 12.013
Expander Controller Firmware : 3022
CPLD Revision : 27
Hardware Revision : LCA 56
Host Interface Module : 50
Host Interface Module Model : 2
Wickedsunny
Valued Contributor

Re: MSA2012i Connection Issue with failed Drive

hi,

The firmware indeed is the latest one. The only other possibility I can think of if the delay in the process of the Spare kicking in to reconstruct the vdisk.

Since you have mentioned that this has happened on 2 vdisks as the failed HDD was a part of both the Vdisks, I would suggest you to check the firmware versions of the Spare and see if its the same as the failed Disk. Basically there was a latency in the Spare to kick in for around 2 minutes as per the logs which caused the I/O to drop. This latency can occur if the Disk has not completely died and the controller is trying to bring it back up. I don't think this is something to worry about.

I have a few old logs which are collected when a spare kicked in. I would go through them and see if I find a similar latency.

Cheers
Sunny


inti
Advisor

Re: MSA2012i Connection Issue with failed Drive

How did you mean, that the Disk was Part of both Vdisks ?

We have a Support-Log from when we switched the storage live about 3 months ago. I cut out the Disk list from before / after the failure. We did not change anything in the meantime. In fact is was very stable.

Interesting for me is, that the Hotspare also got an Owner and in our case it was Ctlr A, while vDisk on Ctlr B had the failing disk. Could the "handover" of the spare Disk be the Reason ? (maybe you have encountered a similar situation in one of your old logs)


Before:

ID Serial# Vendor Rev. State Type Size(GB) Rate(Gb/s) SP
------------------------------------------------------------------------------
0 3QQ0KKWA00009920GG2J SEAGATE 0004 VDISK SAS 450 3.0
1 3QQ0FXWF00009916MD6M SEAGATE 0004 VDISK SAS 450 3.0
2 3QQ0KKA600009920HYR5 SEAGATE 0004 VDISK SAS 450 3.0
3 3QQ0KJ9D00009920ZS1A SEAGATE 0004 VDISK SAS 450 3.0
8 3QQ0MKD5000099213ED4 SEAGATE 0004 GLOBAL SP SAS 450 3.0
9 3QQ0FDNJ000099144N27 SEAGATE 0004 VDISK SAS 450 3.0
10 3QQ0L0ZX000099200V9K SEAGATE 0004 VDISK SAS 450 3.0
11 3QQ0KKVW00009920GG7A SEAGATE 0004 VDISK SAS 450 3.0
------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
Physical Drive to Virtual Disk Mapping
Encl Slot Serial Number Current Owner
-------------------------------------------------------------------------------------
0 0 0x00C0FFD73E370000A75F9549 Ctlr A
0 1 0x00C0FFD73E370000A75F9549 Ctlr A
0 2 0x00C0FFD73E370000A75F9549 Ctlr A
0 3 0x00C0FFD73E370000A75F9549 Ctlr A
0 8 0x000000000000000000000000 Ctlr A
0 9 0x00C0FFD73E0A0000B9BCC749 Ctlr B
0 10 0x00C0FFD73E0A0000B9BCC749 Ctlr B
0 11 0x00C0FFD73E0A0000B9BCC749 Ctlr B

After:

ID Serial# Vendor Rev. State Type Size(GB) Rate(Gb/s) SP
------------------------------------------------------------------------------
0 3QQ0KKWA00009920GG2J SEAGATE 0004 VDISK SAS 450 3.0
1 3QQ0FXWF00009916MD6M SEAGATE 0004 VDISK SAS 450 3.0
2 3QQ0KKA600009920HYR5 SEAGATE 0004 VDISK SAS 450 3.0
3 3QQ0KJ9D00009920ZS1A SEAGATE 0004 VDISK SAS 450 3.0
8 3QQ0MKD5000099213ED4 SEAGATE 0004 VDISK SAS 450 3.0
9 3QQ0FDNJ000099144N27 SEAGATE 0004 VDISK SAS 450 3.0
11 3QQ0KKVW00009920GG7A SEAGATE 0004 VDISK SAS 450 3.0
------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
Physical Drive to Virtual Disk Mapping
Encl Slot Serial Number Current Owner
-------------------------------------------------------------------------------------
0 0 0x00C0FFD73E370000A75F9549 Ctlr A
0 1 0x00C0FFD73E370000A75F9549 Ctlr A
0 2 0x00C0FFD73E370000A75F9549 Ctlr A
0 3 0x00C0FFD73E370000A75F9549 Ctlr A
0 8 0x00C0FFD73E0A0000B9BCC749 Ctlr B
0 9 0x00C0FFD73E0A0000B9BCC749 Ctlr B
0 11 0x00C0FFD73E0A0000B9BCC749 Ctlr B
Vahur Aasala
New Member

Re: MSA2012i Connection Issue with failed Drive

Hi

just the other day I had the exact same situation.
Our customer has MSA2012FC - one RAID Controller.

A couple of days ago, the following event were logged in MSA Event log:
---
10-28 16:25:18 59 A680 Disk channel error (Channel:0 ID:0 SN:3LN6GLNN00009908B43J Encl:0 Slot:0): Abort Timeout cdb:Rd 0edc5b20 0010
10-28 16:25:18 114 A679 Drive link down Chan0
10-28 16:25:18 59 A678 Disk channel error (Channel:0 ID:5 SN:3LN6GKB70000990764F2 Encl:0 Slot:5): I/O Timeout cdb:Rd 0ede8150 0010
10-28 16:25:11 59 A677 Disk channel error (Channel:0 ID:5 SN:3LN6GKB70000990764F2 Encl:0 Slot:5): I/O Timeout cdb:Rd 092cc700 0080
10-28 16:25:11 59 A676 Disk channel error (Channel:0 ID:4 SN:3LN6F2QD00009908X3BF Encl:0 Slot:4): I/O Timeout cdb:2 additional
10-28 16:25:11 59 A675 Disk channel error (Channel:0 ID:4 SN:3LN6F2QD00009908X3BF Encl:0 Slot:4): I/O Timeout cdb:Rd 092cc700 0080
10-28 16:25:11 59 A674 Disk channel error (Channel:0 ID:0 SN:3LN6GLNN00009908B43J Encl:0 Slot:0): I/O Timeout cdb:6 additional
10-28 16:25:11 59 A673 Disk channel error (Channel:0 ID:0 SN:3LN6GLNN00009908B43J Encl:0 Slot:0): I/O Timeout cdb:Rd 092cc700 0080

---
And then the events stoped. No other Events regarding this were reported.

This MSA is connected to a Windows 2008 server, with HYPER-V role.
The problem was, that for about 1 minute and 30 seconds, the MSA volume was "lost" for Win2008. Ofcourse this means, that HYPER-V virtual servers went down (and one of them got corrupted).

Next morning the MSA box was restartart and after that all Drives and Volumse report to be OK. And there seems to be no other problems at the moment.

I'm wondering - what could of happaned? And can we expect these surprises in the future?

I suspect that the firmware at the moment is not the latest, so i'm planning to upgrade it.
inti
Advisor

Re: MSA2012i Connection Issue with failed Drive

Hm, but in your case no Disk was replaced by a Spare ?

I never found out what the Issue was. From then to today the Storage has been stable. I thing I did a Firmware-Upgrade some time after the Issue but we had (luckily) no futher Disk-Faults to check the behavior.

I still guess either the Disk faulted in a very "bad" way or the Spare-Migration from Controller A to Controller B took to long.
Vahur Aasala
New Member

Re: MSA2012i Connection Issue with failed Drive

Hi,

our customer's MSA has 2 vDisks - one with SATA drives and the other one with SAS drives and there's no spare drive for the SAS vDisk. Ofcourse (per Murphy), the problem with the "disk channel error" was reported only, for the drives in SAS vDisk. But since the problem got reported for 3 different physical SAS disks, then I dont know if the Spare would of made any difference anyway.

Anyway, all the volumes on SAS vDisk are fine after the problems (it was a RAID-6 vDisk). Weird stuff.