StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

Storevirtual Multi-Site VSA badly configured?

JorisS
Occasional Advisor

Storevirtual Multi-Site VSA badly configured?

Dear,

A customer has a situation with probably a badly designed SAN network.

Basically they have 2x buildings. In each building they have 1x Storevirtual 4330, 2x SAN switch, 1x ESXI server. Both buildings are connecting to eachother over the SAN switches with 4x fiber. The StoreVirtuals are setup as a Multi-Site configuration so that the ESXi servers have shared storage and High Availability. 

Building 1 room is setup as Multi-Site Primary running the majority of VMs. 4330 has 1x local manager.

Building 2 room is setup as Multi-Site "Secondary" running non-critical VMs, but also has a FOM on a seperate fysical computer in the same room. 4330 has 1x local manager.

Now we have seen that this system is quite redundant. A full power failure of building 1 will still cause building 2 to be fully operational and even have VMware HA succesfully trigger. When building 2 goes down, same scenario: every affected VM boots succesfully in the other building. So far everything perfect this was verified multiple times.

But when I started reading the HP Multi-Site manuals, it turns out this solution is not in the 'recommended' list. Basically recommended is to have the FOM inside the primary 1st building. Also, in that scenario, failure of the 1st building would result in a manual intervention in the secundary room to reactivate the 4330 there. So basically we have something better now because no intervention was required. Does that make sense?

The doom scenario that comes to my mind now is: what if only the network between both buildings goes down and we go to a split brain scenario. Will then both 4330 become active? I assume primary will remain active, and secundary will also remain active because the FOM is in that building? What happens if the network comes online again? There is only 1x VSI for storage. Will we get total data corruption? Will building 1 become the 'primary' again and overwrite any changes that happened in the 2nd room? Chances of network outage are low because all switches, UPS'es and fibers are hot redundant. But what would happen if someone snaps all 4 fibers?

And more importantly: how do we go from this situation to a 'supported' situation that can withstand a network failure and with full active redundancy without requirement of manual interventions. I cannot find this in the documentations.

Best Regards,

Joris

3 REPLIES
oikjn
Honored Contributor

Re: Storevirtual Multi-Site VSA badly configured?

its the classic 3rd site problem, but I think you are making things more complex than needed.

Are you worried that the fiber you have setup for the SAN is going to be over-subscribed?  With only two nodes, I see NO reason you should even both with making this a multi-site setup at all since the latency between the two nodes is going to be minimal and since you only have two nodes, the only benefit you will get is preventing extra traffic from traversing the fiber, but I would be surprised that this is an issue with the setup you described. 

I would suggest looking at converting it to a standard cluster simply to allow both nodes to take better advantage of MPIO scaling.  If you have 4+ nodes, a multi-site setup could make sense, but here you won't gain any redundancy since a two-node cluster is effectively NR1 and not NR10 anyway, so you don't have to worry about making sure that the copy of the data is on both sites since you only have two copies they are going to be at both sites.

 

As for the FOM, it is there specifically to prevent that split brain situation.  If a 3rd site is not possible, I would suggest you keep the FOM at the primary side since you would care more about that site saying online in the event of a network link failure.  The alternative is a 3rd location for the FOM.  It doesn't need the bandwidth of the network nodes, so you can use any minor way to get that connected.  If your company policy allows and you have a non-san switch you can configure a VLAN on to get you to a 3rd location, that could work and provide true automatic failover to the backup building when the primary building goes down for real.

 

 

JorisS
Occasional Advisor

Re: Storevirtual Multi-Site VSA badly configured?

Hi,

 

Thank you for your reply. We have no performance issue so we will keep it a multisite config. I think we fully understand how the system will react with the FOM in the secundary location: it can withstand any full serverroom outage, but it cannot withstand a network outage because this will cause a split brain scenario. Is that correct?

To avoid the split brain scenario, we should move the FOM to the primary location (configured as multisite primary). But the disadvantage of that is that we lose the automatic failover of a full serverroom: when serverroom1 crashes, serverroom2 needs to be activated manually.

So we need to choose between

- network outage will cause split brain (FOM in secundary room)

- primary room outage will cause manual intervention (FOM in primary room)

 

The problem is, I don't know what the consequences of a split brain scenario will be (on a SAN level). That is my main and only question left. What will happen if the network between both locations break, as far as I can tell:

- Both SANs will go active and both rooms will stay operational, but they cannot see eachother

- When the network comes online again... what will happen? Total data corruption of all SANs? Primary room will overwrite all changes that happend meanwhile in the secundary room? I would be very delighted if I got an answer to that question/

 

oikjn
Honored Contributor

Re: Storevirtual Multi-Site VSA badly configured?

The #1 priority of the SAN is to protect the data, so any time there is a "split brain" potential, the system will stop serving data.  This means that any node that is not able to verify that it is part of at least 51% of the voting majority of the managers in the SAN will stop serving data to ensure that it is not acting as part of a split brain situation.  If you have three voting managers and two are in the 2nd site, then if the two sites lose communication the 1st site will shut down and the 2nd site will continue to operate as normal (well normal with a missing node).  If you have zero other options, you should place the FOM at the 1st site instead of the 2nd to ensure that if there is a communication problem, that the 2nd site is the one that fails.

 

I would look hard at finding a 3rd site that the two other sites can talk to even if its over a VPN to another location, in the cloud, or in the closet of a 3rd building.  The practical FOM bandwidth and latency requirements are much lower than the real nodes and keeping the FOM at a 3rd site will make sure that you can have a true automated failover between the two sites.  Without that and with only having two real nodes, I would suggest you consider using the virtual manager instead of the FOM as it will be easier to manually restart the system in that situation than it would be to manually restart the system if your FOM is suddenly gone.

 

In your case where you are on the same campus with plenty of bandwidth and likely little latency between the two "sites" and only two physical nodes, I would suggest moving to a single site setup so that you can benefit from the ability for servers to read data from both nodes at the same time.  This will likely also minorly increase your writing IO as you reduce the read IO on each individual node.  The added latency and bandwidth in your specific situation with only two nodes and having them so close together is just so minor to be unimportant.  IMO, If you had four nodes, then a multi-site setup would make sense, but not for just two nodes.