- Community Home
- >
- Storage
- >
- Midrange and Enterprise Storage
- >
- StoreVirtual Storage
- >
- Strange issues, need your advice :)
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО06-07-2019 03:49 AM - edited тАО06-07-2019 04:49 AM
тАО06-07-2019 03:49 AM - edited тАО06-07-2019 04:49 AM
Strange issues, need your advice :)
Hello,
We have a strange problem. First of all let me explain our setup:
We have 2x StoreVirtual 4330 units equipped with 4x 3.8TB Micron 5100 Pro SSD each in RAID5 configuration. We use LeftHand 12.5, each of the storage is in RAID 5 and they are both in network raid 10 (each one is synced to the other). There is also a FOM available to determine which storage is offline. They were running since last October.
Last Thursday one of the drives on SAN2 (bay 3) failed. We got an alert from iLO that it got disconnected. We tried removing it and re-entering it on it's slot in case it was a glitch and in iLO we saw "ready for rebuild" without it actually beginning rebuilding.
We reboot the unit and got the following message during POST:
For some reason it said "Port 1I Box 1 Bays 0,3" although the bay with the problem was bay 3 (if we count bays 1-4) or bay 2 (if we count bays 0-3).
We pressed F2 and the unit booted up, started rebuilding the array until after 1,5 hours it got sluggish and started having high latency in the VMs. We shutdown the server, removed that drive, booted back again and everything went back to normal.
We replaced it with a new drive. It started rebuilding reaching 57%, however after 15 minutes we got the same sluggish behavior and high latency in the VMs, so we started thinking that something else must be up. I should mention that without that drive, with the other 3 remained, everything worked fine without getting any hint of something else happening.
We reboot the server again, and didn't let it boot into LeftHand so that the array rebuilt itself without affecting the other storage in the cluster.
It completed after several hours
Error code 1716 as showed in the previous POST screen made us think that there is another defective drive too which prevented the storage from rebuilding the array, however in the last line of the error code we see "Backup and Restore recommended". Not a single word about replacing any drive. I found http://h17007.www1.hpe.com/docs/iss/shared/gen9/error/Advanced/Content/242703.htm and https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02785607 which recommend fsck or reconfiguration of the array and restore from backup. Nothing about disk replacing here too.
Now the strange things start...
-I booted up with SPP and ran insight diagnostics. In "status" I saw for each drive "This drive IS functioning within the proper operating specifications and should NOT be replaced"
-In "Test" I checked the 4 SSDs. Drives 1, and 3 failed the Random Read Test and Scattered Read Test", but all drives passed the "SMART Error Test". Drive 3 here is the *replacement* drive
-Running smartctl -t long on the drives, returns a "Completed without error" on the extended offline test.
-Running smartctl shows the last 5 errors for each drive (except from replacement which has 0 errors logged), which ALL have this: "occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)" They all state that the last errors happened at 0 hours, so they should have nothing to do with the storage operation
-On the SAN1 storage, we brought it offline and ran the same procedures and also got failures on 2 SSDs, although it worked fine without anything reported to us.
1) Is there a possibility the drives failed HP's random read and scattered test because they were 3rd party drives?
2) Should I use the disks again, since even HP's recommendation of it's software is not to replace them? Also smartctl shows no errors on the extended offline test
3) Any indications on why did error code 1716 showed up in POST and continued showing up until I deleted the array?
4) Any indications on why did RAID5 rebuilding had that sluggish behaviour resulting in high latency on the VMs which were on the other mirror storage?
I should mention that these drives are the ones that HPE also brands.
George Vardikos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО06-11-2019 03:49 AM - edited тАО06-11-2019 03:51 AM
тАО06-11-2019 03:49 AM - edited тАО06-11-2019 03:51 AM
Re: Strange issues, need your advice :)
Hi George,
Thank you for contacting HPE Community. I checked the disk numbering in Storevirtual 4330 and it starts from 1.
The 1716 error will come if there is any drive with Unrecoverable Medium Error. Please refer the following article for reference.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01702138
HPE does not recommend using third party hard drives as they can exhibit unexpected behavior. Any HPE Authorized Drive will have an Option Part Number, that can be brought from the reseller.
Could you please check the current status of the drives from the CMC? Are they showing fine?
The CMC also gives the option to run the Diagnostics on the hardware components
Could you please let us know if you have any further queries?
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО06-11-2019 11:11 AM
тАО06-11-2019 11:11 AM
Re: Strange issues, need your advice :)
Hi Mohamed,
CMC shows the drives in OK status, the only thing which shows a problem is Insight Diagnostics (even smartctl returned a normal long test result).
The drives may be 3rd party but are from the same Vendor and have the exact model number, as you can see in the last picture.
Regarding 1716 error shows that there was an Unrecoverable Medium Error, however in it's troubleshooting even in the link you provided it doesn't show "replace drives", only "Sequential write operations to the affected blocks should resolve the media errors.". Is it possible that 1716 error is on a logical volume level (a problem which is caused by the controller on the RAID 5 configuration) and not a hardware error?
George Vardikos
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО06-12-2019 05:55 AM - edited тАО06-12-2019 05:56 AM
тАО06-12-2019 05:55 AM - edited тАО06-12-2019 05:56 AM
Re: Strange issues, need your advice :)
Hi George,
Thank you for your update. Even though the third party hard drives are of the same model and same capacity and vendor, they will not be running the Custom Firmware meant for HPE Drives. This causes unexpected behavior and hence we don't recommend using third-party hard drives.
The error 1716 can refer to any medium errors with the hard drives, hard drive firmware or controller firmware and that has to be verified from ADU Report and Slot Logs.
Please let me know if you have any further queries.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]