Re: Weird Hyper-V Error - Connectivity Lost

MichaelClark · ‎11-24-2011

Hi,

we have the below problem. Any assistance would be much appreciated

Current Configuration:

- See attached Detailed Design document

- The HP c7000 blade chassis is connected to Cisco 3750 stack (of 3 switches) using 2 x 1Gb shared uplinks for each VC

- Each VC is connected to a separate Etherchannel group on the 3750 stack

- Each Hyper-V hosts has 3 network teams (Management, Production (VM’s only), Live Migration) configured with Network Fault Tolerance

Issue:

During testing the following (see below points), we experience occasional dropout of INBOUND connections to certain guests, the connection re-establishes after an arbitrary time, but if we initiate an OUTBOUND connection via the VMRC console (which still works) INBOUND connections start working.

- Live Migrate guests from one Hyper-V host to another

- Network cable disconnect tests between the Cisco stack and VC’s

We have checked the following, and believe they have been ruled out:

- Driver / Firmware versions

Virtual Connection Manager 3.30 firmware
HP NC551i Firmware must be updated to 4.0.360.15
Windows Network Driver must be updated to 4.0.360.8
Windows FCoE Driver must be updated to 2.50.007
NCU 10.45.0

Hongjun Ma · ‎11-28-2011

Hi Mike,

When you say that you have incoming connectivity issue, are you refereing to the traffic coming from 3750 side into VC? when you have the problem, if you do

show mac-address <mac address of the destination>

on 3750? Does 3750 know the destination is on VC side? If so, then you can also ssh into VC and do

show interconnec-mac-table enc0:1 "mac address"=<mac address of the destination>

what do you see there? enc0:1 means first VC module. Are you using active/active uplink design or active/standby design? You may need to do enc0:2 if you know the traffic is using uplink on VC moodule 2.

you should see something like this. "d1" means phsical 10G server downlink 1 to blade 1.

->show interconnect-mac-table enc0:1 "mac address"=12:12:12:12:12:13

=======================================================

Port MAC Address Type Internal ID LAG ID

=======================================================

d1 12:12:12:12:12:13 Learned 4 -- --

Also, do the above commands for comparision after you fix the problem by initiating an outbound connection.

Thanks

Hongjun

My VC blog: http://hongjunma.wordpress.com

MichaelClark · ‎11-28-2011

Hi Hongjun,

yes I am referrinf to traffic coming from other locations (all of which is on the 3750 stack which is their network core)

Interestingly enough when the problem is occuring the "show interconnect-mac-table" command has an interesting result set

WHEN WORKING it looks like this (we are disconnecting the 2 cisco cables from each VC module in-turn)

->show interconnect-mac-table enc0:1 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
(lag) 00:15:5D:06:65:0F Learned 8            27
d5     00:15:5D:06:65:0F Learned 5            -- --

->show interconnect-mac-table enc0:2 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
d5     00:15:5D:06:65:0F Learned 8            -- --
(lag) 00:15:5D:06:65:0F Learned 5            27

WHEN the GUEST is not pingable (i.e.e it is faulting) it looks like

->show interconnect-mac-table enc0:2 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
d5     00:15:5D:06:65:0F Learned 8            -- --
(lag) 00:15:5D:06:65:0F Learned 5            27

->show interconnect-mac-table enc0:1 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
(lag) 00:15:5D:06:65:0F Learned 8            27
(lag) 00:15:5D:06:65:0F Learned 5            26

Once the guest starts pinging again the above result goes back to the "WHEN WORKING" result

Can you advise if this indicates and issue on the flexfabric modules/config thereof?

Hongjun Ma · ‎11-28-2011

Hi Mike,

Interesting. Could you confirm if you are using VC active/active uplink design or active/standby uplink design? From the output, I guess it's active/active design. which means uplinks on both VC modules are active.

I noticed when it's working VC1 has the MAC learned from server downlink and when it's not VC1 doesn't have the MAC learned from server downlink.

Another interesting thing is that when it's working, you actually have the same MAC learned from downlink port on both VC1 and VC2 module. If you are using NFT teaming on windows, the MAC should be consistently learned from one side, either it's VC1 or VC2.

If you have the same MAC learned from two links, the VC and 3750 may get confused sometimes because of MAC address flapping.

So could you check with your teaming configuration(is it NFT with preference order where you set one NIC as primary)? If that's the case, you can map this primary NIC's MAC to a specific VC module, as I said, either it's VC1 or VC2 but not both.

Hongjun

My VC blog: http://hongjunma.wordpress.com

MichaelClark · ‎11-29-2011

We have active- active setup – with 2 shared uplink sets (one per VC)

Maybe we should amalgamate shared uplink sets onto one and see what that would do to the problem?

We have NFT NIC Teaming (with no preference set)

How do we map this primary NIC's MAC to a specific VC module - I dont know how to make this setting - are you referring to setting a preference order or something at a lower level??

Hongjun Ma · ‎11-29-2011

something we can isolate the problem

1) try to disabled one NIC(say standby NIC) under teaming and monitor your incoming connecitivyt and MAC table on VC and 3750. Do you still see the problem? If not, then does MAC table on VC always show one VC learned the address from its downlink? that should be the case.

2) If the above works, then we know it's something about MAC causing the original issue. then we can try to set NFT with preference set so you can pick one NIC as primary as always to see if you still see the problem.

3) If you still see the problem with 2 NICs under teaming, the way you can see which NIC is mapped to which VC module is to go to HP teaming utility and open property for logical teamed NIC, there you should see two physical NICs and if you scroll to the right, you should see some mapping to see which NIC is mapped to which VC module.

Again the whole point is that you shouldn't learned the same MAC from both NICs, that'll cause issues potentially.

My VC blog: http://hongjunma.wordpress.com

MichaelClark · ‎11-30-2011

1. With one connection disabled we still see the same problem

2. With the Teaming set to NFT with Preference order we still see the same problem

I have also tried dropping the links to the CISCO stack to a single cable on both sides and we still see the problem

The only time we dont see the problem is if we maintain a ping OUTWARDS from the guests

There are two scenarios where we see the problem - 1st is when we drop Links (emulating switch failures/disconnects). The second is after a hyper-V Live Migrate (possibly not as common but testing takes much longer so we have not done a lot in this area)

When we see it by DROPPING the links there are two times it may occur - 1st is immediately when we drop a connection (or pair of connections). The Second is about 20 seconds after we re-connect the link or links it may cause the problem

Cheers Michael

Hongjun Ma · ‎11-30-2011

ok. I missed the part that the problem will only happen after you do some failover test or migration.

I'd like to check 2 things, which I think you already enabled, just want to make sure

1) under shared uplink set, you have all vnets as "smartlink" enabled

2) under domain wide, ethernet setting, there is an advanced tab which has an option called Fast MAC-cache failover, you have it enabled and have the default value.

let's focus on uplink failover scenario, could you still reproduce this state? If so, could you tell me exactly how you get to this state? You have 4 cables(2 LACP bundles) between VC and 3750 stacking, VC1 port5/6 and VC2 port5/6, which cable did you diconnect to see the problem and have the following MAC state as below? also, could you give me show mac-address on 3750-stacking for 00:15:5D:06:65:0F and let me know which port-channel it's pointing to? Also, just want to make sure you have "spanning-tree portfast" or portfast trunked enabled under the ports connecting to VC. that's best practice for faster convergence.

WHEN the GUEST is not pingable (i.e.e it is faulting) it looks like

->show interconnect-mac-table enc0:2 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
d5     00:15:5D:06:65:0F Learned 8            -- --
(lag) 00:15:5D:06:65:0F Learned 5            27

->show interconnect-mac-table enc0:1 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
(lag) 00:15:5D:06:65:0F Learned 8            27
(lag) 00:15:5D:06:65:0F Learned 5            26

My VC blog: http://hongjunma.wordpress.com

MichaelClark · ‎11-30-2011

Yes we have smartlink enabled on all networks

Yes Fast mac-cache is enabled with 5 sec value (have also tried 1 sec but made no difference)

Our Normal way to cause the problem is to shutdown 2 ports on the CISCO (the ones that as a pair connect to one of the Virtual connects). We then "no-shutdown" them and shutdown the other pair - then "no-shutdown" them and repeat this cycle. We normally have to run around 5-10 shutdowns and no shutdowns (with pings to 6 hosts and 4 Guests) to see the problem occur.

The problem does still occur with only 1 port connected on either side and using the same shutdown method. We have not tried without LACP enabled as yet ont he trunk (we want to try this but need to find a test window)

We also see the problem on approx 1 in 15 Live-Migrations of guests (We dont do most of our testing like this as the tests (live migrations) take a long time to complete - i.e. if a guest has 24 GB RAM this can take 10 minutes to complete).

We have also removed the HP Teaming from the GUEST VM network and done live-migrates and can reproduce the problem.

Hongjun Ma · ‎11-30-2011

ok. got it in regard to how to reproduce the problem.

so in this state, do you remember which uplinkset is down on VC side? the SUS on VC1 or VC2?

Also, assuming this is the state when VC1 SUS is down, that means the traffic should in/out into c7000 by one path: 3750 (po2)<---> VC2 <----> NIC2 server

I'm assuming 3750 etherchannel number is 2 for example, then when things are not working, could you verify if show mac-address <MAC> on 3750 points to etherchannel 2 and then on server NIC teaming, NIC1 is down(because of smartlink).

Also, you can use VC CLI "show uplinkset <SUS name" to see which SUS is using LAG ID 26. I guess your internal VC stacking link is using LAG ID 27 and your VC1 uplink SUS is using 26.

I don't think this problem is related with LACP so far.

WHEN the GUEST is not pingable (i.e.e it is faulting) it looks like

->show interconnect-mac-table enc0:2 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
d5     00:15:5D:06:65:0F Learned 8            -- --
(lag) 00:15:5D:06:65:0F Learned 5            27

->show interconnect-mac-table enc0:1 "mac address"=00:15:5D:06:65:0F
========================================================
Port   MAC Address        Type     Internal ID LAG ID
========================================================
(lag) 00:15:5D:06:65:0F Learned 8            27
(lag) 00:15:5D:06:65:0F Learned 5            26

My VC blog: http://hongjunma.wordpress.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Weird Hyper-V Error - Connectivity Lost

Weird Hyper-V Error - Connectivity Lost