HPE SimpliVity

A faulty power module lead to one SimpliVity host down, leading to some datastores inaccessible

 
bhwong
Occasional Contributor

A faulty power module lead to one SimpliVity host down, leading to some datastores inaccessible

Due to data center maintenance where alternative power sources will be shutting down overnight, we schedule a maintenance to power off one source to ensure all our equipments are able to failover properly. 

Unfortunately one of our SimpliVity host has a faulty power module such that when source A is down, source B went down as well, despite having power source available to it. When we switch back all the power, the ESXi booting get freezed for hours.

In a federation of 3 SimpliVity hosts, even with one host down, the other 2 hosts should be able to power up the VMs from the affected host. But we are caught by surprise that all these VMs are showing disconnected. Even some of the VMs in the other 2 hosts suddenly get power down as well. And the datastore folders of these affected VMs are all empty!

One of the operating hosts has this error: SimpliVity Datastore Access Impaired Warning

2 REPLIES 2
dhooley
HPE Pro

Re: A faulty power module lead to one SimpliVity host down, leading to some datastores inaccessible

Hello @bhwong ,

 

Your understanding is correct assuming that all VM's were in a HA state when the node went offline. However, any VM's which were on that node when ESXi crashed would have gone offline, but should have been restarted on the alternative 2 nodes via vSphere HA (if enabled). Any VM's already running on the other two nodes should have stayed running assuming all configurations are correct.

I would strongly advise getting a case opened with our support teams as there is quite a bit to go through here and a deep dive into logs will be necessary. Hope this helps!


I work for HPEAccept or Kudo
BoonHong
Occasional Contributor

Re: A faulty power module lead to one SimpliVity host down, leading to some datastores inaccessible

We have raised 3 support cases:

1. 5355613778 - ESXi Stuck at booting after power failure

HPE support seem at lost what to do and after 4 hours on the phone, they passed the issue to vmware support. vmware support suggests reimage the ESXi, as each reboot takes more than 2 hours and often end up with 503 service unavailable.

Fortunately it recovered by itself after 3rd reboot. We are concerned that the next reboot may end up with the same issue. 

2. 5355617308 - SimpliVity Datastore Access Impaired Warning

Since our email server was also down, HPE support still attempt to contact us on the primary email, which we cannot amend, despite we have added our alternative email contact and mentioned that our primary email is down, which we are unable to amend in the support case submission.

With all affected VMs showing disconnected, there is no way HA can restart them on alterative nodes. The option to power or migrate them are also not available as well. Fortunately we found a workaround. Removed the node from vCenter and add these VMs back into the other two nodes manually.

However, we get this error that there is no network assigned to these virtual machines and all the details were also missing. Thus, we are unable to power up as well. After a while, the details start appearing and there is a SimpliVity data sync warning too, followed by this error: Storage HA protection lost for VM_Name on datastore Storage_Name
 
But what's most worrying are the 4 of our largest VMs have all their files disappeared from SimpliVity datastore folders. Moreover, these VMs are hosted at the other two nodes. They just power off suddenly. This shouldn't and never happenied in our previous iSCSI shared storage environment.
 
It appears that after adding the VMs back, Simplivity will then start to sync the data between the two hosts. This should be automatically done and should not be trigger by adding of VMs back into the vCenter. Because it also means that there is no way we can add the 4 largest VMs back since there is no file and no vmx file for us to do so.
 
It was only when the affected node return online from case 1 that all the data returned to these empty folder. If this node wasn't able to came back online, we could have lost these 4 VMs for good!?
 
The Support response made we more doubt of SimpliVity reliability: generally if there is any abrupt shutdown of the OVC's we have risk that the VM's ownership will not be claimed by the existing OVC's due to tree versions insufficient. 

 

3. 5355618011 - power module faulty

HPE support told me that there is no downtime for replacing one power module. They appears not to be paying attention to my case details, so I have to remind him that this server has a fault such that it is unable to retain power with just one power module. And we didn't want to down this server and face a repeat of losing SimpliVity NFS storage access, and it is unable to boot up again into ESXi environment.