Virtual Managers/Failover not working, why?

Paul Hutchings · ‎10-18-2009

OK so I've downloaded the ESX VSA demo and setup two nodes.

I have a single site, and a single cluster containing both nodes.

The cluster has a virtual IP.

I have a couple of volumes, each set to 2-way replication.

I can continuously ping the virtual IP so long as the node that is the virtual manager is up and running, if I power that node off (either via the CMC or to simulate the failure of a link or node) I lose the ability to ping the virtual IP, and of course my test server loses access to iSCSI volumes.

I believe I need to move the virtual manager, but it doesn't seem to give any option to do this in the CMC if it thinks the virtual manager is running on the failed/down node, it just says the manager is offline and I seem to go around in a circle where I can't make the other node the virtual manager until I stop the existing virtual manager - which of course I can't do as that node could be down/on fire which is the whole point :-)

I guess I'm doing something wrong, and as much as I read the manual I don't know what?

René Loser · ‎10-19-2009

Hi Paul,

The concept with the Virtual Manager is a manual failover process.
You just add a node to the Virtual Manager in case of a failure.

I recommend to use the FOM (Failover Manager) running on another host (could be Virtual Server or VMware Workstation or VMware Player). The FOM runs always instead of the Virtual Manager. FOM is the decision maker in case of a failure and works automatically.

You should find the FOM Image as well on the CD.

Best regards,
reNe

HP Presales Storage

Mike Povall · ‎10-19-2009

Hi Paul,

Under normal operating circumstances you should not have a virtual manager running - it should be started manually on the surviving node when a failure occurs thus restoring quorum and access to the volumes.

Using a Failover Manager is definitely the best solution for your environment as it will run all of the time and will maintain quorum and access to the volumes during the periods when one of your nodes is offline for whatever reason.

Regards, Mike.

Paul Hutchings · ‎10-19-2009

Thanks guys, a little after posting I worked out that a FOM is what I needed, so I installed one and it works seamlessly (well, a few seconds delay whilst it figures out what's happening but near as dammit seamless).

Paul

teledata · ‎10-19-2009

The Virtual Manager must be configured in advanced.

You may then manually start the virtual manager (on the remaining node) to re-establish quorum.

Using Virtual Manager is NOT providing high availability, as you WILL loose quorum and have to manually start the virtual manager.

The better design is to create a Failover Manager to provide that automated 3rd manager that will maintain quorum for you...

http://www.tdonline.com

Paul Hutchings · ‎10-22-2009

OK next question :-)

I guess this question applies to any sort of redundant storage that is seen as a single addressable "cluster" by ESX:

Suppose you have two sites, A and B.
Each contains some nodes making up a storage cluster.
Each contains some servers, ESX most likely.
A and B are linked by a fast LAN link.

Let's say you lose the link between A and B, but the kit in each location is up and running.

You now have "highly available" storage that is still available in both locations.

You have ESX servers that can each still see the storage local to them, but can't see the other servers in the ESX cluster as the link has gone.

So don't you end up with the same VM's now running in both locations as each ESX box's HA would kick in, and each ESX box can still access its shared storage as your SAN is resilient?

I'm sure I'm overlooking something obvious here as I can only think about it right now vs. actually do it.

teledata · ‎10-22-2009

Split-brain scenario is prevented by requiring a majority of storage managers to maintain quorum.

In a perfect world you would have a 3rd site (that is connected to the first 2 sites) that hosts your failover manager.
If that configuration is not possible, you then run your failover manager at your primary site that you want to stay up if your site link goes down.

If you have 4 nodes (2 at each site), only the site that has access to 3 managers (the 2 local nodes, plus failover manager) will have quorum (and thus access to storage)in the event of a link failure.

http://www.tdonline.com

Gauche · ‎10-23-2009

This might be handy.
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c01727773/c01727773.pdf

Adam C, LeftHand Product Manger

Paul Hutchings · ‎10-23-2009

Thanks, I'd actually tested failover using the FOM in this exact scenario and I clearly wasn't thinking when I posted the question - only the site that has Quorum will have the cluster IP so you can't actually have two sites in "split brain".

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Virtual Managers/Failover not working, why?

Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?

Re: Virtual Managers/Failover not working, why?