Array Setup and Networking

Re: VMware ESXi w/ iSCSI boot - controller failover behavior

 
SOLVED
Go to solution
dmweimer52
New Member
Solution

Re: VMware ESXi w/ iSCSI boot - controller failover behavior

I have the exact same issue with my iSCSI boot volumes and I found a solution/workaround.

If you lose connectivity to the NIC that runs the boot LUN (switch reboot, cable disconnect, controller reboot/failover, etc.), you will see the following error: Lost connectivity to the device backing the boot filesystem. As a result, host configuration changes will not be saved to persistent storage. This error is being displayed because connectivity is lost and the iSCSI boot does not support Multi-pathing, which means that if connectivity is lost between the controller on the Nimble and the NIC on the host, the host can no longer access its boot lun and cannot write logs, etc. The good news is that the whole ESXi OS is loaded into memory so there is no outage for the VMs or the hosts. Once connectivity is restored the host can access the storage again. The bad news is that the error does not clear automatically. I can neither confirm nor deny that the host does in fact reestablish connectivity automatically after the failover and would be able to write logs even while still displaying the error message. I suspect that this is the case, but perhaps someone with a deeper understanding can speak to that.

The easiest way to fix this error/warning is to put the host into maintenance mode and reboot it. Unfortunately, this takes time and requires lots of vMotion activity.

The other way to resolve this (and it can be done without a reboot) is to restart the management agents on the host. This can be done in two ways:

1) Use the remote KVM of each host, log into the ESXi console and follow the menu options to restart the management agents.

2) SSH into each host and run the commands to restart the management agents.

I'm including a link here to a blog post that outlines these processes well.

https://fvandonk.wordpress.com/2014/01/08/iscsi-boot-disk-disconnect-fix/

gary_martin
Occasional Advisor

Re: VMware ESXi w/ iSCSI boot - controller failover behavior

Hi,

Just picked up this thread because I am about to embark on moving our ESX boot volumes from NetApp to Nimble.  It's quite timely too as I recently had an issue where I took down a NetApp (one node in a cluster) but due to running in single image mode the boot configuration had locked in the boot path (as found above no multi-pathing for boot volume).  This killed off a few hosts and did some things I didn't like.  Was hopeful that moving to Nimble might free me of this configuration but looks like it might actually be ever so slightly worse (I was dumb to take down my NetApp node, as I had disabled clustering).

So, it looks like I will need to rebuild my ESX hosts with a slightly bigger datastore (currently running booting from 1GB LUN, no local datastore and remote datastore for swap/logs) booting from Nimble.  Almost tempted to add local disks to my servers, but seems like a waste of UCS to do that (trying to keep stateless).

I'll keep in mind the information here.  I found some info on the VMWare Communities site about using Powershell and PowerCLI to restart management agents (might be quicker than enabling SSH or using KVM to the console in to each box).

PowerCLI command to restart management agents o... | VMware Communities

Script is

Get-VMHostService -VMHost MyEsx | where {$_.Key -eq "vpxa"} | Restart-VMHostService -Confirm:$false -ErrorAction SilentlyContinue


Could probably get hosts in a cluster and pipe it into that command to restart on each host, maybe introduce a sleep between each one so they don't all stop responding at the same time.


Ideally VMWare would allow the changing of the configuration location (/etc) to a datastore (where MPIO would be available) but then really there is an overlap between that PXE booting.  I'd love to do PXE boot, but we don't have the kind of money required for Enterprise Plus licensing.  Or even just a way to adjust that disk timeout other that the one already tried.