StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

Strange issues, need your advice :)

 

Strange issues, need your advice :)

Hello,

We have a strange problem. First of all let me explain our setup:

We have 2x StoreVirtual 4330 units equipped with 4x 3.8TB Micron 5100 Pro SSD each in RAID5 configuration. We use LeftHand 12.5, each of the storage is in RAID 5 and they are both in network raid 10 (each one is synced to the other). There is also a FOM available to determine which storage is offline. They were running since last October.

Last Thursday one of the drives on SAN2 (bay 3) failed. We got an alert from iLO that it got disconnected. We tried removing it and re-entering it on it's slot in case it was a glitch and in iLO we saw "ready for rebuild" without it actually beginning rebuilding.

We reboot the unit and got the following message during POST:

post.png

For some reason it said "Port 1I Box 1 Bays 0,3" although the bay with the problem was bay 3 (if we count bays 1-4) or bay 2 (if we count bays 0-3).

We pressed F2 and the unit booted up, started rebuilding the array until after 1,5 hours it got sluggish and started having high latency in the VMs. We shutdown the server, removed that drive, booted back again and everything went back to normal.

We replaced it with a new drive. It started rebuilding reaching 57%, however after 15 minutes we got the same sluggish behavior and high latency in the VMs, so we started thinking that something else must be up. I should mention that without that drive, with the other 3 remained, everything worked fine without getting any hint of something else happening.

We reboot the server again, and didn't let it boot into LeftHand so that the array rebuilt itself without affecting the other storage in the cluster.

It completed after several hours

Error code 1716 as showed in the previous POST screen made us think that there is another defective drive too which prevented the storage from rebuilding the array, however in the last line of the error code we see "Backup and Restore recommended". Not a single word about replacing any drive. I found http://h17007.www1.hpe.com/docs/iss/shared/gen9/error/Advanced/Content/242703.htm and https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02785607 which recommend fsck or reconfiguration of the array and restore from backup. Nothing about disk replacing here too.

Now the strange things start...

-I booted up with SPP and ran insight diagnostics. In "status" I saw for each drive "This drive IS functioning within the proper operating specifications and should NOT be replaced"

 

drivenotreplaced.png

-In "Test" I checked the 4 SSDs. Drives 1, and 3 failed the Random Read Test and Scattered Read Test", but all drives passed the "SMART Error Test". Drive 3 here is the *replacement* drive

 

scattered2.png

 

-Running smartctl -t long on the drives, returns a "Completed without error" on the extended offline test.

 

smarttest1.png


-Running smartctl shows the last 5 errors for each drive (except from replacement which has 0 errors logged), which ALL have this: "occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)" They all state that the last errors happened at 0 hours, so they should have nothing to do with the storage operation

 

smarttest2.png

 

-On the SAN1 storage, we brought it offline and ran the same procedures and also got failures on 2 SSDs, although it worked fine without anything reported to us.

1) Is there a possibility the drives failed HP's random read and scattered test because they were 3rd party drives?
2) Should I use the disks again, since even HP's recommendation of it's software is not to replace them? Also smartctl shows no errors on the extended offline test
3) Any indications on why did error code 1716 showed up in POST and continued showing up until I deleted the array?
4) Any indications on why did RAID5 rebuilding had that sluggish behaviour resulting in high latency on the VMs which were on the other mirror storage?

I should mention that these drives are the ones that HPE also brands.

 

drives.png

 

Sincerely,
George Vardikos
3 REPLIES 3
Mohamed_I
Frequent Advisor

Re: Strange issues, need your advice :)

Hi George,

Thank you for contacting HPE Community. I checked the disk numbering in Storevirtual 4330 and it starts from 1.

The 1716 error will come if there is any drive with Unrecoverable Medium Error. Please refer the following article for reference.

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01702138

HPE does not recommend using third party hard drives as they can exhibit unexpected behavior. Any HPE Authorized Drive will have an Option Part Number, that can be brought from the reseller.

Could you please check the current status of the drives from the CMC? Are they showing fine?

The CMC also gives the option to run the Diagnostics on the hardware components

Could you please let us know if you have any further queries?

 

I am an HPE Employee

Re: Strange issues, need your advice :)

Hi Mohamed,

CMC shows the drives in OK status, the only thing which shows a problem is Insight Diagnostics (even smartctl returned a normal long test result).

The drives may be 3rd party but are from the same Vendor and have the exact model number, as you can see in the last picture.

Regarding 1716 error shows that there was an Unrecoverable Medium Error, however in it's troubleshooting even in the link you provided it doesn't show "replace drives", only "Sequential write operations to the affected blocks should resolve the media errors.". Is it possible that 1716 error is on a logical volume level (a problem which is caused by the controller on the RAID 5 configuration) and not a hardware error?

Sincerely,
George Vardikos
Mohamed_I
Frequent Advisor

Re: Strange issues, need your advice :)

Hi George,

Thank you for your update. Even though the third party hard drives are of the same model and same capacity and vendor, they will not be running the Custom Firmware meant for HPE Drives. This causes unexpected behavior and hence we don't recommend using third-party hard drives. 

The error 1716 can refer to any medium errors with the hard drives, hard drive firmware or controller firmware and that has to be verified from ADU Report and Slot Logs.

Please let me know if you have any further queries.

 

 

 

I am an HPE Employee