HPE SimpliVity
1825662 Members
3494 Online
109686 Solutions
New Discussion

SimpliVity host taking a long time to re-join cluster following a reboot

 
jlangmead
Regular Advisor

SimpliVity host taking a long time to re-join cluster following a reboot

Hi

I've just completed a new deployment of SimpliVity 4.1.3 two-node cliuster which has been upgrade to 4.1.3.95 on the OVCs and ESXi 7.0.U3k on the nodes. The VCSA is running the latest support build as detailed in the interop guide.

The initial deployment completed successfully and the cluster works as expected in every sense with the exception that one of the hosts is slow in shutting down (takes about 10 mins) and after the node boots back to the login screen, it takes about 10mins before the VCSA shows the as back in the cluster. The node is pingable but just shows as 'not responding' in vCenter. Any attempt to log into the host Web client shows a '503 Service Unavailable' error - similar to this technote Accessing the ESXi host through the Host Client UI fails with error: "503 Service Unavailable" (2144962) (vmware.com)

After about 10mins the VCSA shows the node as back in the cluster and the Web Client becomes accessible for root logins. The other node doesn't exhibit this behavior and reboots without any issues and as soon as the node completes the boot the VCSA shows it as back in the clsuter and the Web Client is immediately accessible.

I'm looking for troubleshooting options to prove if this is hardware or software and if the best remedial option is perhaps to re-image this node and redeploy it? I'd rather not take that root if it might be a case of fixing it some other way. There are some VMware technotes that talk about editting the /etc/vmware/rhttpproxy/endpoints.conf file and removing some lines - but these lines don't exist anyway, and the file looks to be identiical to the file on the host that reboots fine.

Anyone seen this before or have any thoughts??

Many thanks in advance

3 REPLIES 3
Mahesh202
HPE Pro

Re: SimpliVity host taking a long time to re-join cluster following a reboot

Hi jlangmead

The slow shutdown and delayed response from the host in vCenter after the reboot could be indicative of an issue with the ESXi host configuration or a potential hardware problem. Here are some troubleshooting steps to help diagnose and resolve the issue:

  1. Ensure that the hardware components (server, storage, network adapters) are compatible with the version of ESXi you are running. Check the VMware Compatibility Guide to verify compatibility.
  2. Double-check the network configuration on the problematic ESXi host. Ensure that the host has a valid IP address, subnet mask, gateway, and DNS settings. Confirm that it can communicate with other hosts, the VCSA, and the SimpliVity cluster.
  3. Examine the ESXi logs, specifically the vmkernel.log, hostd.log, and vpxa.log, for any error or warning messages related to the delayed shutdown and unresponsiveness. Look for any indications of network issues, storage problems, or misconfigurations.
  4. Confirm that the SimpliVity integration components, such as the SimpliVity OVC (OmniStack Virtual Controller) and the SimpliVity vSphere Web Client Plugin, are properly installed and configured on the ESXi host. Check for any specific recommendations or requirements in the SimpliVity documentation.
  5. Ensure that the ESXi host has sufficient resources (CPU, memory, storage) allocated to it. Check the host's resource utilization and compare it to the other host in the cluster.
  6. Review the vCenter Server configuration, including networking, licensing, and permissions, to ensure there are no issues impacting the problematic host. Verify that the vCenter Server is running the latest supported build.
  7. Search the VMware Knowledge Base and forums for any known issues or similar cases related to the symptoms you are experiencing. VMware's support team can also help in diagnosing and troubleshooting the issue.
  8. If all else fails and you've exhausted all troubleshooting options, re-imaging the problematic ESXi host and redeploying it may be a valid solution. However, it is recommended to involve HPE SimpliVity Tech support before taking this step, as they can provide guidance based on their expertise.

Remember to perform thorough backups and document any configuration changes before proceeding with any major changes or re-imaging.

Hope this helps.!!

Regards
Mahesh.

 

 

 



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
jlangmead
Regular Advisor

Re: SimpliVity host taking a long time to re-join cluster following a reboot

@Mahesh202 

So we checked everything and all settings checked out ok.In the end we went for the factory reset and re-image which has worked and now everythng seems ok,

However, there still remains one further issue. When we use the plugin to shutdown the OVC the task to wait for HA compliance starts and either shows 50% or 100% complete but never actually finishes. The OVC itself is shutting down in the background and after a 15min timeout will eventually power off. This is a two node cluster and we see the same behaviour whichever OVC we shutdown. If we migrate a VM between hosts this completes ok, we get the yellow warning whilst the VM syncs following the vMotion but this goes after a few seconds as expected. When we run a svt-vm-show all the VMs are compliant.

Why are we getting this error and why only when we issue an OVC shutdown?

When we restart the OVC the 'waiting for storage HA compliance' tasks then completes.

many thanks

 

Mahesh202
HPE Pro

Re: SimpliVity host taking a long time to re-join cluster following a reboot

Hi jlangmead

The behavior you described, where the task to wait for HA compliance hangs at either 50% or 100% when shutting down an OVC through the plugin, can be indicative of a delay or issue with the synchronization process between the OVCs in the cluster. Here are a few steps you can take to further investigate and potentially resolve the issue:

  1. Verify that the network connectivity between the OVCs is stable and reliable. Ensure that there are no network misconfigurations, firewall rules, or other network-related issues that could cause delays or interruptions in communication between the OVCs.
  2. Check the status of the storage synchronization process between the OVCs. Ensure that the storage network is functioning correctly and that there are no errors or issues reported in the SimpliVity management interface or logs related to storage synchronization.
  3. Verify the health and availability of the underlying storage infrastructure. Ensure that all storage devices, storage controllers, and storage networks are functioning properly and are not experiencing any performance or connectivity issues.
  4. Ensure that both OVCs in the cluster are running the same version of the SimpliVity software. A version mismatch between the OVCs could potentially cause synchronization issues and lead to the behavior you're experiencing.
  5. It's worth noting that the behavior you observed might not necessarily indicate a critical issue if the OVC eventually shuts down and restarts successfully. However, to ensure the proper functioning and synchronization of the OVCs, I would suggest you consult HPE SimpliVity Tech support to investigate and address the root cause of the delayed HA compliance task.

I hope these suggestions help you in resolving the issue with the HA compliance task during the OVC shutdown.

Regards
Mahesh.

If you feel this was helpful please click the 
KUDOS! thumb below!



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo