SimpliVity 380 3.7.8 with GPU instability

FredZone · ‎05-31-2019

Hello,

I have installed two different environments with each time 2x SimpliVity 380 (one in direct connect) without any specific problems.

Recently I had to install another environment 2x SimpliVity 380 with NVidia GPU in 10GB direct connect.
The installation went fine. After few days, the first node crashed completely for unknown reason. Because the environment was not in production, I reinstalled completely both nodes (factory reset).
Everything went fine during 2-3 weeks until some VMs have the warning sign "Lost Sync" and the first EXi appaered "disconnected". I could still ping the ESXi server, the Omnistack, the VMs running on this host but impossible to connect to esxi console (web, ssh or via iLO) nor Omnistack. There was no other options than reboot the server.
After the reboot, the Omnistack couldn't be synchronised. Some errors started to appear on the VMs running on the second node/omnistack (errors to start, errors to reach some files, ...).

I opened a ticket at HPE and the maraton started:
The engineer analyzed some logs and took some support bundles ... but no reason has been found for the crash.
He tried to reinitialized the failing omnistack during two days (launching scripts, editing the VC MOB, editing the disks seen in the Omnistack with hexadecimal editor, ...) without any success. He said afterwards there is no other option than re-deploy the node completely from the deployment manager.
The node has been redeployed and the Omnistack resync everything. The HPE support said that everyhting is fine and close the case.
Unfortunately, few hours later 7 VMs were lost ... they disappeard from the storage. We restored them from backup.
After that, we tried to re-create some VDI Pools but they couldn't access the datastores so we delete and re-create the datastores from the SimpliVity interface. It worked but now we have errors related after a recompose on all VMs: File system specific implementation of loct(file) failed ... The customer has lost confidence in SimpliVity.
I have the feeling that the storage is corrupted and we will have storage issues all the time ...
I will open a new ticket at HPE to see what they will say but for the moment I'm disapointed by their support.

Has somebody expercienced something similar? Or an idea?

Best regards,
FredZone

DeclanOR · ‎05-31-2019

Hi @FredZone

Thanks for using the forum.

I am curious about the exact case here. Don't get me wrong........you obviously had an experience which wasn't to your liking, and for that I apologise.....but being a senior support engineer on the SimpliVity platform for a number of years now, some of your statement seems a bit odd - i.e, the sequence of events doesn't make sense.

For example, we don't just "loose" VM's. If you did indeed loose VM's, then you must have had a serious hardware issue or something like that, and also a single node environment. Any two node or more cluster provides automatic protection from such potential issues by providing SimpliVity storage HA and subsequent failover to secondary data copies.

I don't think it will be possible to provide a solution or ideas to your query without understanding the exact issue correctly, with all the facts.

If you have not already created a new support ticket, please do so. If you want to post the case ID here, I can maybe follow up also and check what's happening with it.

Thanks,

DeclanOR

I am a HPE Employee

FredZone · ‎05-31-2019

Hi DeclanOR,

Thx for your quick answer.
I also thought about HW issues but no error has been seen.

The support said that "something" changed in the environment. We inspect all the layers and everything seems normal.

Here are the two tickets number:
5338444463 ==> normally close
5339021888 ==> just opened now

Thanks if you can do something,

FredZone

DeclanOR · ‎05-31-2019

Hi @FredZone

Thanks for providing the info.

I have looked at both cases, and both cases are opened and pending an update from you.

I recommend that you respond to the requests on both cases and allow our engineers to complete any work needed.

The original case is very well documented and I can see the sequence of events. I will reach out to the owner of the original case and confirm a couple of things, but please continue to work on and respond to the open cases and allow our engineers complete all necessary work.

Thanks again,

DeclanOR

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

SimpliVity 380 3.7.8 with GPU instability

SimpliVity 380 3.7.8 with GPU instability

Re: SimpliVity 380 3.7.8 with GPU instability

Re: SimpliVity 380 3.7.8 with GPU instability

Re: SimpliVity 380 3.7.8 with GPU instability