BladeSystem Virtual Connect
Showing results for 
Search instead for 
Did you mean: 

Splitting 4-Enclosure VC Domain into 4 VC Domains (Inter&Intra-enclosure vMotion problems)


Splitting 4-Enclosure VC Domain into 4 VC Domains (Inter&Intra-enclosure vMotion problems)

Our business requirements and risk assesment is driving us to split our 4-enclousure VC domain into separate, single-enclosure VC domains. We don't mind the additional management overhead, but welcome the mitigated risk of a catastrophic VC domain failure (which we somewhat experienced a few years ago during a VC firmware upgrade).


We have a pair of demo enclosures (courtesy of our HP sales rep) and for the external "stacking" backbone we purchased a pair of HP 5820X L3 switches with 24x 10Gb interfaces. The only traffic we intend on pushing through these switches is vMotion network traffic and maybe VMware FT traffic (if we ever decide to enable it). All other LAN traffic goes straight to our Core network switches and SAN traffic goes through our FC Interconnects.


Our intial testing looked promising. We went through the motions of disjoining one of the enclosures from the two-enclosure test VC domain. On the separated enclosure, we created a new domain, with new vNets, new uplink sets, new server profiles, and etc. All generally mirrored off of the former master enclosure, with different addresses of course.

We added a new shared uplink set and put the vMotion network VLAN on it, plugged that inerface (X4 on each pair of Flex-10 interconnects) to the HP 5820Xs.


During the first phase of testing we fired up ESX hosts on both enclosures. Both hosts are cluster members in ESX and share similar configurations. We initiated a vMotion of a test VM from one host to the other, zip zap, with not even a single ping loss it was successful. The sysadmin war-room erupts in cheers, handshakes, cigar lighting, then I suggested to test it again but with a simulated switch failure (one of the HP 5820Xs). So I shut down interfaces on one of the switches, then we tried vMotion again... the task just sat there until it timed out with an error.  

THE FIX: So after playing around with vNIC settings in ESX, we decide to enable Beacon-Probing on the vMotion network in ESX. Boom... this allowed ESX to auto-failover between the available paths. The vMotion worked flawlessly, even when I shut down a path mid-task, it would complete the vMotion with a short delay using the other available path.


The second phase of testing involved doing a vMotion within the same enclosure... THIS is where we noticed the second issue. The task itself worked, but it is extremely slow compared to the same task of moving a VM enclosure-to-enclosure. To give you an example, it took appox 25 seconds to vMotion a VM between enclosures versus approx over 2 minutes for the same task, same VM between hosts in the same enclosure. 

THE FIX: Turn off Beacon-Probing on the vMotion network in ESX... This resulted in vMotion tasks returning back to normal speeds, but brought us back to the original problem that we discovered during the first phase of testing...


So, how can we have our cake and eat it too?


There must be a correct way of configuring this kind of setup. HP supports multi-VC Domains, I would assume that they considered designs where customers needed to live-migrate between VC domains.


If you think you know the answer but need more detail, please let me know, I'll diagram the heck out of things for you.




Re: Splitting 4-Enclosure VC Domain into 4 VC Domains (Inter&Intra-enclosure vMotion problems)

There is a setting called Smartlink in the VC Network settings that will make VC Mgr communicate an uplink failure (to your external switch) to the downlinks (to your ESXi blades) so that vSphere will know to use an alternate path.  Can you check if you have turned that on for the vMotion network that connects the enclosures?


This works in conjunction with 2 VC Networks in 2 uplink sets that are used for vMotion.  Each of the uplink set would connect to a different switch so that if the link or switch failed, that path failure would be communicated through Smartlink to the ESXi server and it would use the other path.


You might want to check the HP Virtual Connect FlexFabric Cookbook.