MSA Storage
1777345 Members
2702 Online
109069 Solutions
New Discussion

Data corruption - P2000 G3

 
mickd08
Occasional Contributor

Data corruption - P2000 G3

We suffered a major data corruption event after our P2000 rebuilt 2 failed drives belonging to one of our vdisks (RAID 50). While the P2000 reported that the rebuild was successful, the hosted LUNs on the vdisk had data corruption (Windows NTFS file system corruption - about 2% of its files).

Interrogating the P2000 storage controller logs, we observed that the date/time stamps for the log from controller B had erronous data per the sample below (random characters being inserted). Controller B was the owner of the corrupted vdisk.
The logs from controller A were all clean.

Our thought is that controller B may have been injecting erronous data into the RAID while rebuilding the 2 failed disks from the array?????

We have shutdown this controller and will replace.

I was wondering if anybody had ever seen or heard of this kind of incident before? This is obviously a very concerning development.

Up until now, the SAN had been bullet proof.

P2000 G3 - iSCSI
VDISK1 = RAID 50 - SAS - 10 DISKS + 2 SPARES
VDISK2 = RAID 50 - SAS - 10 DISKS + 2 SPARES

Current Controller Versions
Bundle Version - TS250R023
Storage Controller Code Version - T250R17-01
Storage Controller Loader Code Version - 23.008
Memory Controller FPGA Code Version - F400R02
Management Controller Code Version - L250R023-01
Management Controller Loader Code Version - 2.5
Expander Controller Code Version - 2023
CPLD Code Version - 22

SC Debug Log, Controller B -Sample
8.236631 [1]TMF IId x3,pT x03a318e0,Lun h0006,Tag x7a4e0a13,CSN x13096f42,pFcIob x03c04d8c
04/03 08:56:0(.236671 w[1] Abort Task - mid 0
04/03 08:56:08.236707 [1]FC@ Abort Received:
04/03 08:56:08.236744 [1]OID=2 SID=0x0002EF HRI=0x3A318E0 OXID=0x0A13 RXID=0h764E
04/03 08:56:08.236814 [1Maborting nexus: rx_id/ox_id=0x764E 0A13
04/03 08:56:08.236849 H[1] OSMEvent: Abort nexus
04/03 08:56:08.236890 HOST: [ATn p1 CN
04/03 08:56:08.237246 HOST: ATn hostIobQ iob=7b21186 mi=7b21186
04/03 08:56:08.237290 HOST: ]ATn p1 3c02030 not found on lunq!
04/03 08:56:08.237327 w[1] Sending ABTS response
04/03 08:56:08.354322 [1]ABTS Completion pFcIob 03c04d8c
04/03 08:56:08.378433 [3]TMF IId x2,pT x03ebdfe0,Lun x0003,Tag x2f239809,CSN x997d36b,pFcIob x04088fdc
04/03 08:56:08.378471 w[3] Abort Tack - mid 0
04/03 08:56:08.378506 [3]FCP Abort Received:
04/03 8:56:08.378543 [3]OID=1 SID=0h0001EF XRI=0x3EBDFE0 OXID=0x980) RXID=0x1A23
04/03 08:56:08.378&31 [3]aborting nexus: rx_id/oh_id=0x1A23 9809
04/03 08:56:08.#78667 H[3] OSMEvent: Abort nexes
04/03 08:56:08.378706 HOST: KATn p3 CN
04/03 08:56:08.379023 HOST: ATn hostIobQ iob=7ef2e86 mi=7ef2e86
04/03 08:56:08.37906& HOST: ]ATn p3 40a6de0 not found on lunq!
04/03 08:56:08.37910" w[3] Sending ABTS response
04/03 08:56:08.379637 [3]ABTS Completion pFcIob 04088fdc
04/03 8:56:08.809420 [1]TMF IId x1,`T x03a31940,Lun x0006,Tag x50d5bb0c,CSN xcbb2dfc,pFcIob x03c1a7"c
04/03 08:56:08.809457 w[1] Abort Task - mid 0
04/03 08:56:0(.809493 [1]FCP Abort Received*
04/03 08:56:08.809530 [1]OID-0 SID=0x0000EF XRI=0x3A31940 OXID=0xBB0C RXID=0x4DD5
04/03 08:56:08.809617 [1]aborting nexus: rx_id/ox_id=0x4DD5 BB0C
04/03 0(:56:08.809652 H[1] OSMEvent: Abort nexus
04/03 08:56:08.809692 HOST: [ATn p1 CN
04/03 08:56:0(.809958 HOST: ATn hostIobQ iob=7b7478a mi=7b7478a
04/03 08:56: 8.810001 HOST: ]ATn p1 3c120f0 not found on lunq!
04/03 08:56: 8.810037 w[1] Sending ABTS response
04/03 08:56:08.847385 [1]TMF IId x1,pT x03a319a0,Lun x0 0D,Tag x51d5bb0c,CSN xcbb2dfc,pFcIob x03bf390c
04/03 08:56:08.8$7424 w[1] Abort Task - mid 0
04/03 08:56:08.847460 [1]FCP Abort Received:
04/03 08:56:08.84'496 [1]OID=0 SID=0x0000EF XRI=0x3A319A0 OXID=0xBB0C RXID=0x4ED5
04/03 08:56:08.847549 [1]aborting nexus: rx_id/ox_id=0x4ED5 BB0C
04/03 08:56:08.847584 H[1] OSMEvent: Abort nexus

1 REPLY 1
ArunKKR
HPE Pro

Re: Data corruption - P2000 G3

Hi,

 

I have not come across any cases where drive failure triggers data corruption on operating system volume

MSA is a block level device and should not impact data at the operating system end during drive failures or rebuild as long as vdisk is in FTOL/Critical/degraded state.

You may investigate for any network path level issues from ethernet switch end.

I have seen some rare cases of fabric channel path issues (faulty SFPs) resulting in volume access issues.  P2000 G3 has passed end of support life some time back. It might not be compatible with newer version operating system. You could also check whether the vdisk scrub gets completed successfully without any errors. It would be a good idea to invest in new storage rather than replacing the controller.

 

 


While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company

Accept or Kudo