HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
BladeSystem Virtual Connect
cancel
Showing results for 
Search instead for 
Did you mean: 

VC FlexFabric simple failover testing

 
neilburton
Advisor

VC FlexFabric simple failover testing

I'm currently working on a new VC-FF implementation and have hit an unexpected brick wall when attempting to perform basic failover testing to prove to the end customer that the solution is resiliant to failure of a VC module.

 

The configuration is simple, one enclosure, one pair of VC-FF in bays 1&2, both modules running 3.17

 

For the purpose of demonstrating failover I've started with a clean new domain and have created a very simple configuration from scratch, with one SUS with a single uplink port from each module (so active/passive), a single network defined and presented to LOM 1a and 2a on several blades.  Windows 2008 R2 is installed on the blades and the NIC 1a/2a pair are teamed for NFT with the NCU.

 

Network connectivity works just as expected with both VC-FF modules online.  I can ping between blades within the VC and to/from hosts outside the VC domain.

 

If I power off either VC module then network connectivity breaks in all directions - I can't even ping between blades within the enclosure.  I have left it running for 10 minutes or more and connectivity is not restored until the offline VC module is powered back up.

 

Whilst the module is down, NCU shows the NIC ports failover just as one would expect.  If I log onto VCM on the surviving module, the SUS uplink port has also failed over as one would expect.  The VCM alerts show the domain as being in a 'minor' degraded state.

 

As we are using the FlexHBAs for storage I can also confirm that the fibre channel paths failover as expected, so this is only affecting ethernet.

 

Have raised a call with HP Support but would appreciate any other thoughts about this .

 

Regards, Neil

9 REPLIES
chuckk281
Trusted Contributor

Re: VC FlexFabric simple failover testing

I have asked one of our experts for advice.

Stevem
Frequent Advisor

Re: VC FlexFabric simple failover testing

Please provide a more detailed configuration of your VC Network, as well as how the NICs are assigned to the NICs as this certainly is not correct.  Also, with both modules up ping the router, then fail-over from the active uplink to the stby link and ping again (or run ping -t to the router), make sure you can get to the router through either uplink.  Also, make sure Private Networks and is NOT selected.

 

Any additional info would be helpful.

 

Steve....

Stevem
Frequent Advisor

Re: VC FlexFabric simple failover testing

Also, screen shots would be helpful.
neilburton
Advisor

Re: VC FlexFabric simple failover testing

Guys, thanks for the replies

 

First of all I should say that the FlexFabric environment concerned is on a customer site and I'm not there at the moment, however I can ask guys down there to run tests. 

 

I have a similar type of config (with respect to SUS / network assignment) running on my own Flex-10 kit and the same failover tests work fine for me.  For example if I perform a module reset in OA:

 

1) on my (Flex-10) kit I lose no more than one or two pings to blades, whilst failover takes place, if I reset the module carrying the active SUS uplink.  If I reset the module carrying the standby SUS uplink I see no pings lost at all

 

2) on the customer (FF) kit if I reset either module all pings are dropped for approximately 70 seconds until the reset module becomes available again

 

The only difference between my deployment and the customers (other than the fact that its Flex-10 vs FF) is that I am pinging ESXi hosts and the customer is pinging Windows 2008 R2 hosts.

 

I have asked them to build some ESXi blades meanwhile so we can determine if this is host OS specific - if it is then this points towards an issue with Windows / HP NIC teaming driver rather than VC itself.

 

I asked the customer to perform a NIC failover by changing the NCU NFT preference yesterday and although a handful of pings were dropped in this process (more than I would expect) the traffic could be made to pass through both modules.  Also we could fail the SUS uplinks between the modules.

 

Interestingly if we power off both VC modules and then bring one of them online, everything works normally.  It's only when we are running with two modules - and we remove one of them - that the whole thing crashes.

 

To confirm the NIC / network configuration we have one shared uplink set with active/standby links comprising a 4x10Gbps LACP port group from each module.  We have defined one ethernet network, associated it with the SUS, and created a simple server profile with 2xEthernet ports and 2xFCoE ports.  The ethernet ports (1a / 2a) are associated with the single ethernet network.

 

As stated connectivity works exactly as expected both between blades and in/out of the VC domain (pinging to/from gateway for example) and we can fail over server NICs and SUS uplinks to prove that both modueles are passing traffic.  The minute we drop a module ALL communication fails.  The same test on my seperate Flex-10 environment results in seamless failover.

neilburton
Advisor

Re: VC FlexFabric simple failover testing

Steve - Private networks definitely not selected. Also a SUS uplink failover (invoked by disabling upstream switch ports) is seamless - traffic passes over both links and only 1 ping is lost during uplink failover.
neilburton
Advisor

Re: VC FlexFabric simple failover testing

I have also raised the matter with HP Support but the incident is still being escalated at the moment - I don't appear to have got through to anyone who really understands the problem.

neilburton
Advisor

Re: VC FlexFabric simple failover testing

OK interesting update - this connectivity failure is only affecting Windows blades.

 

We've done exactly the same failover test on a few blades running ESXi and failover works as expected.

 

So it must be an issue affected the Windows FlexNIC driver / Network Configuration Utility rather than a Virtual Connect problem.

 

Both NIC driver and teaming components were installed from the recently released PSP 8.70 so are up-to-date.  I am just checking the FlexNIC firmware levels.

neilburton
Advisor

Re: VC FlexFabric simple failover testing

Guys we've solved the problem, it was a driver/firmware issue.

 

I can't confirm the previous versions exactly but I had been assured the blades had been updated with Firmware DVD 9.20 and PSP 8.70

 

I identified the following standalone firmware and driver packages and requested that they were installed on the Windows blades - and this has resolved the failover problem entirely.

 

  • Combined Windows (FCoE and NIC) Driver Kit - F:2.33.008/N:2.102.517.0 (15 Feb 2011)
  • LOM Firmware image for offline update - 2.102.517.703 (23 Feb 2011)
  • BIOS - System ROM - 2010.12.20 (B) (25 Feb 2011)
  • Firmware - Lights-Out Management - 1.20 (5 Apr 2011)
  • OneCommand Manager Application Kit - 5.0.80.5 (13 Dec 2010)
chuckk281
Trusted Contributor

Re: VC FlexFabric simple failover testing

Great to hear. Just a lot of work.