StoreVirtual Storage
1748130 Members
3659 Online
108758 Solutions
New Discussion

VSA Multisite and VMWare FT

 
SOLVED
Go to solution
dajma00
Occasional Advisor

VSA Multisite and VMWare FT

Is it possible to have the following scenario:

 

Two site VSA.  Lets say site1 and site2.

Site1 is primary and has the FOM running.

Running a VM in Fault Tolerant (FT) mode.

 

If we lose site1, will the FT machine continue to run from site 2?  My understanding is that it will not continue to run because the FOM has been lost on site1 and hence site2 is also dead and FT secondary that was running on site2 cannot run anymore.  Is that correct?

10 REPLIES 10
5y53ng
Regular Advisor

Re: VSA Multisite and VMWare FT

Correct. There would be a loss of quorum in this scenario, so the ESX(i) hosts would be disconnected from the SAN. This condition would persist until you brought site 1 back online.

oikjn
Honored Contributor

Re: VSA Multisite and VMWare FT

yea, you would need manual intervention there to deal with quorum unless you have the FOM hosted at a third site.

 

I haven't priced it out, but it shouldn't be too expensive to have some hosting provider run a FOM for you since the storage/CPU/Network requirements are all very minimal... if you need dual live sites you can probably shell out the coin for a nosted 3rd location.  Our company couldn't so we went with two seporate management groups using async replication instead.

dajma00
Occasional Advisor

Re: VSA Multisite and VMWare FT

Ok, lets say we have FOM running on third site. What happens if FOM dies? Will the primary site continue?

Also, if only the connectivity between FOM and primary site is lost but FOM to secondary site connection is still up, does the secondary site now become active?
oikjn
Honored Contributor

Re: VSA Multisite and VMWare FT

The third site for the FOM is ideal.  The FOM is just another vote.  You need to have a major of managers in contact in order to have quorum. 

 

If you have two nodes each at two sites, that is four votes, but you need a total of three for quorum so what happens when one site with two nodes goes down?  The answer is you lose quorum.  The answer is to add a FOM, but if you put that FOM at the same site as two of the other nodes, you end up with only two nodes voting when the site with the FOM goes down so you lose quorum anyway.

 

Trying to do automated failover with TWO sites is very tricky.  It seems like it should be simple, but the issue of dealing with split-brain is not really possible without human intervention.

 

To answer your final question, I don't know 100% and would have to test, but I would imagine that as long as your primary and backup sites are still in communication with each other, it would not be a problem if your FOM loses contact with one of your sites since as long as the two sites have communication together they can provide quorum on their own without the FOM and I would be the SAN/IQ software has some logic about how to deal with a situation like what you asked about.  On the flip side...  what would happen if you don't have a FOM at a 3rd site and the two sites lose communication?

dajma00
Occasional Advisor

Re: VSA Multisite and VMWare FT

Yes, split brain becomes a problem.

 

So we need to test out what happens if FOM links goes down to the primary, secondary or both.  I guess HP should have tested these scenarios somewhere.. any document?  or I have to test out myself...

dajma00
Occasional Advisor

Re: VSA Multisite and VMWare FT

btw, on EMC VPLEX, once a split brain occurs, you can configure it so that a reconnection between the sites does not take place and hence primary continues to work, secondary continues to work as well but of course it can be manually configured for no access from outside.
oikjn
Honored Contributor

Re: VSA Multisite and VMWare FT

the manual probably talks all about this.

 

How can VPLEX really deal with split brain?  It is impossible for a system to know if it has gone split-brain or if one complete site just failed.

 

If you have site 1 w/ two nodes and the FOM, it has three votes.  If your second site has the other two nodes (and two votes), what is the difference between losing communication between the two sites and if one site were to suddenly go up in smoke?  The answer is NOTHING and the site with the three nodes will stay live while the site with two nodes will go offline thinking it has no quorum majority.  This is exactly what you want to happen because it ensures that you only have one live copy of data.  IF the site w/ 2-nodes relizes that it is split and still serves data, now you have two sites with to different copies of your information.

 

 

There is just no way to have two sites live running as a single cluster and avoid every possible downtime situation.

 

As you are figuring out, there are tons of complications when trying to deal with a multi-site sync. cluster.  99% of the time, it makes more sense to just use snapshot replication.... just because the software has the multi-site cluster capabilities doesn't mean it makes the most sense for you to use that feature.  Often times the costs associated with doing multi-site sync is just way too much...  for example, you need to have a connection with the uptime/bandwidth/latency requirements to meet your local bandwidth usage with it typically $$$$$$$$.  If you don't have that connection, your local data is going to have to slow down because you have to wait for the data to get sent to the remote site and confirmation of the write before your servers get confirmnation of their write.  At that point, what is ~50$/month in hosting on windows asure or amazon cloud or whatever other cloud service you chose for your third virtual site to actually setup the multi-site cluster according to HP best practicies..

 

dajma00
Occasional Advisor
Solution

Re: VSA Multisite and VMWare FT

Right.  I understand the split brain issue.

 

But here is how VPLEX handles it:  Once the sites are split, they are independent and do not connect back without manual intervention.

 

Now this may not be acceptable for certain situations, but in our case, I guess this feature is extremely useful.

Assume that we have a database running on the primary site.  The secondary site has been configured for no client access from outside.

Once the split occurs, pirmary continues without any problems.  The secondary site ups the database in its own right thinking it is primary.  But since clients do not have access to it and it is not talkning to the main site, everything is fine.

 

Now, I agree that this scenario may not be that simplistic in practice but I think this is one acceptable and workable solution in a split brain scenario -- both sites becoming primary and application configuration will decide which one continues to operate.

 

Anyway, we will be testing the third site configuration for FOM and see how it goes.

 

Thanks,

Amjad.

oikjn
Honored Contributor

Re: VSA Multisite and VMWare FT

I"m not sure exactly what the practical difference is betwen a LUN that is down and a LUN that has an initiator has no access to... either way the server connected to the LUN is gonna have a heart attack.  If that downtime is acceptible, then you should be able to get away with a virtual manager instead of a FOM and just follow the procedures to force quorum manually any time there is a failover event.  Doesn't sound ideal to me, but if you are happy with the required manaul intervention on VPLEX, then manual intervention on SAN/IQ should be just as easy.  Or you could keep a standby clone of the FOM at teh 2nd site and then simply turn it on when the first site goes down.  The important thing with that is to make sure you turn off the original FOM or keep it from starting up when the primary site comes back online.  Then once you get everything back online you could shut down the cloned FOM and turn back on the original.

 

Nice thing about the VSAs is that you can test out everything you are curious about for free in a VM lab.  Just spool up four VMs and an FOM and setup a new management group and assign the IPs and sites as you want and then play around with different failover situations.  Theonly tricky one is the FOM communcation with one but not both sites which would probably require an L3 switch or a firewall to force that situation.