Comware Based
cancel
Showing results for 
Search instead for 
Did you mean: 

IRF-Fabric in Split brain but no restore possible?

 
Frequent Visitor

IRF-Fabric in Split brain but no restore possible?

Hi

we are facing a problem after the updating our Flex Fabric Switch 7900 to Patch Version 7

Our IRF-Fabric is in a split brain situation after rebooting (possibly the wrong?!?!?!) Module.

We have now the situation that IRF-ID 1 is the active switch and IRF-ID2 is disabled

 

############## IRF Switch 1 ###################

[sw79]display irf
MemberID  Slot  Role    Priority  CPU-Mac         Description
*+1      10    Master  1         00e0-fc0f-8c0b  sw79
   1      11    Standby 1         00e0-fc0f-8c0c  sw79
--------------------------------------------------
* indicates the device is the master.
+ indicates the device through which the user logs in.

The bridge MAC of the IRF is: 2c23-3a00-ff80
Auto upgrade                : yes
Mac persistent              : always
Domain ID                   : 2
Auto merge                  : yes

#############################################

 

############## IRF Switch 2 #####################

<sw79>display irf
MemberID  Slot  Role    Priority  CPU-Mac         Description
*+2      10    Master  1         00e0-fc0f-8c23  sw83
   2      11    Standby 1         00e0-fc0f-8c24  sw83
--------------------------------------------------

* indicates the device is the master.
+ indicates the device through which the user logs in.

The bridge MAC of the IRF is: 2c23-3a01-0000
Auto upgrade                : yes
Mac persistent              : always
Domain ID                   : 2
Auto merge                  : yes
##############################################

 

The LACP MAD output is the following

############## IRF Switch 1 ###################

[sw79]display mad verbose
Multi-active recovery state: No
Excluded ports (user-configured):
Excluded ports (system-configured):
  IRF physical interfaces:
    FortyGigE1/0/0/1
MAD ARP disabled.
MAD ND disabled.
MAD LACP enabled interface: Bridge-Aggregation2
  MAD status                : Faulty
  Member ID    Port                                    MAD status
  1            Ten-GigabitEthernet1/1/0/21             Faulty   
MAD BFD disabled.

##############################################

 

############# IRF Switch 2 ######################

<sw79>display mad verbose
Multi-active recovery state: Yes
Excluded ports (user-configured):
Excluded ports (system-configured):
  IRF physical interfaces:
    FortyGigE2/0/0/1
MAD ARP disabled.
MAD ND disabled.
MAD LACP enabled interface: Bridge-Aggregation2
  MAD status                : Faulty
  Member ID    Port                                    MAD status
  2            Ten-GigabitEthernet2/1/0/21             Faulty   
MAD BFD disabled.
##############################################

 

We have rebooted IRF-Switch 2  with no success

Also we have done  “mad restore” on IRF-Switch 2
But nothing helped to get the IRF back to working.


Does anyone have an idea 


If you need mor information please let me know.

 

Best regards

 

Christoph

9 REPLIES 9
HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

Hi @CR85 !

Since you have rebooted Chassis 2, did you see any errors just after reboot of Chassis 2 explaining why IRF can't form up? Check with "display version" if both chassis are running the same s/w version, since you mention upgrade it's very possible one of your chassis to remain on the older version and that's why IRF can't form. Logbuffer messages from both Chassis before and after reboot of Chassis 2 should reveal why IRF process can't form the stack. Also, what is the state of IRF ports on both Chassis - up, down? What is the state of BAGG1 from Chassis 1's and 2's perspective?

 

 

I am an HPE employee

Accept or Kudo

Frequent Visitor

Re: IRF-Fabric in Split brain but no restore possible?

Both switches are on the same Version   (Version 2713 H7)

 

HPE Comware Software, Version 7.1.070, Release 2713
Copyright (c) 2010-2018 Hewlett Packard Enterprise Development LP
HPE FF 7910 uptime is 0 weeks, 0 days, 3 hours, 43 minutes
Last reboot reason : User reboot

Boot image: flash:/7900-CMW710-BOOT-R2713.bin
Boot image version: 7.1.070P2216, Release 2713
  Compiled Aug 24 2018 11:00:00
System image: flash:/7900-CMW710-SYSTEM-R2713.bin
System image version: 7.1.070, Release 2713
  Compiled Aug 24 2018 11:00:00
Patch image(s) list:
  flash:/7900-CMW710-SYSTEM-R2713H07.bin, version: P007
    Compiled Aug 24 2018 11:00:00

 

on Chassis 2 i see the following Output on logbuffer: 

 

%Nov 20 15:07:46:018 2020 sw79 LLDP/5/LLDP_NEIGHBOR_AGE_OUT: -Chassis=2-Slot=1; Nearest bridge agent neighbor aged out on port Ten-GigabitEthernet2/1/0/8 (IfIndex 6033), neighbor's chassis ID is b8d4-e750-6a00, port ID is 55.

%Nov 20 15:07:46:017 2020 sw79 LLDP/5/LLDP_NEIGHBOR_AGE_OUT: -Chassis=2-Slot=1; Nearest bridge agent neighbor aged out on port Ten-GigabitEthernet2/1/0/18 (IfIndex 6043), neighbor's chassis ID is 7010-6f86-c100, port ID is 55.

%Nov 20 15:07:45:602 2020 sw79 LLDP/5/LLDP_NEIGHBOR_AGE_OUT: -Chassis=2-Slot=4; Nearest bridge agent neighbor aged out on port Ten-GigabitEthernet2/4/0/2 (IfIndex 6750), neighbor's chassis ID is 2880-2391-2168, port ID is 2880-2391-2168.

%Nov 20 15:07:45:643 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux2/0-IPAddr=**-User=**; Command is display logbuffer reverse
%Nov 20 15:07:23:796 2020 sw79 LAGG/4/LACP_MAD_INTERFACE_CHANGE_STATE: LACP MAD function enabled on Bridge-Aggregation2 changed to the faulty state.
%Nov 20 15:07:16:236 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux2/0-IPAddr=**-User=**; Command is display logbuffer reverse
%Nov 20 15:06:56:538 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux2/0-IPAddr=**-User=**; Command is display logbuffer
%Nov 20 15:05:53:794 2020 sw79 LAGG/5/LACP_MAD_INTERFACE_CHANGE_STATE: LACP MAD function enabled on Bridge-Aggregation2 changed to the normal state.
%Nov 20 15:05:47:022 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE2/0/0/6 changed to down.
%Nov 20 15:05:46:938 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface Ten-GigabitEthernet2/3/0/24 changed to down.
%Nov 20 15:05:46:915 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface Ten-GigabitEthernet2/3/0/23 changed to down.
%Nov 20 15:05:46:897 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface Ten-GigabitEthernet2/3/0/22 changed to down.
%Nov 20 15:05:46:895 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface Ten-GigabitEthernet2/5/0/1 changed to down.
%Nov 20 15:05:46:886 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE2/0/0/5 changed to down.
%Nov 20 15:05:46:881 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface Ten-GigabitEthernet2/1/0/24 changed to down.

 

On Chassis 1 i see the following output 

%Nov 20 15:08:32:503 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux1/0-IPAddr=**-User=**; Command is display irf
%Nov 20 15:07:47:300 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux1/0-IPAddr=**-User=**; Command is display irf
%Nov 20 15:07:35:090 2020 sw79 LAGG/4/LACP_MAD_INTERFACE_CHANGE_STATE: LACP MAD function enabled on Bridge-Aggregation2 changed to the faulty state.
%Nov 20 15:06:05:500 2020 sw79 SHELL/5/SHELL_LOGIN: TTY logged in from aux1/0.
%Nov 20 15:05:55:090 2020 sw79 LAGG/5/LACP_MAD_INTERFACE_CHANGE_STATE: LACP MAD function enabled on Bridge-Aggregation2 changed to the normal state.
%Nov 20 14:13:59:746 2020 sw79 SHELL/5/SHELL_LOGOUT: manager logged out from 192.168.168.16.
%Nov 20 14:13:59:743 2020 sw79 SSHS/6/SSHS_DISCONNECT: SSH user manager (IP: 192.168.168.16) disconnected from the server.

 

IRF Status / IRF-Port STatus / BAGG2 Status --- Chassis1  (Online Chassis)

 

[sw79]display irf link 
Member 1
 IRF Port  Interface                             Status
 1         disable                               --    
 2         FortyGigE1/0/0/1(MDC1)                DOWN  
[sw79]display irf top


[sw79]display irf topology 
                              Topology Info
 -------------------------------------------------------------------------
               IRF-Port1                IRF-Port2          
 MemberID    Link       neighbor      Link       neighbor    Belong To
 1           DIS        ---           DOWN       ---         00e0-fc0f-8c0b


[sw79]display link-aggregation verbose Bridge-Aggregation 2
Loadsharing Type: Shar -- Loadsharing, NonS -- Non-Loadsharing 
Port Status: S -- Selected, U -- Unselected, I -- Individual 
Port: A -- Auto port, M -- Management port, R -- Reference port 
Flags:  A -- LACP_Activity, B -- LACP_Timeout, C -- Aggregation, 
        D -- Synchronization, E -- Collecting, F -- Distributing, 
        G -- Defaulted, H -- Expired 


Aggregate Interface: Bridge-Aggregation2
Creation Mode: Manual
Aggregation Mode: Dynamic
Loadsharing Type: Shar
Management VLANs: None
System ID: 0x8000, 2c23-3a00-ff80
Local: 
  Port                Status   Priority Index    Oper-Key               Flag
  XGE1/1/0/21         S        32768    181      70                     {ACDEF}
Remote: 
  Actor               Priority Index    Oper-Key SystemID               Flag   
  XGE1/1/0/21(R)      0        49       290      0x2bc0, c091-34e3-2bc0 {ACDEF}

sw79]display interface FortyGigE 1/0/0/1 brief

Type: A - access; T - trunk; H - hybrid
Interface            Link Speed   Duplex Type PVID Description                
FGE1/0/0/1           DOWN auto    A      --   --   IRF-Link to sw83

 

 

IRF Status / IRF-Port STatus / BAGG2 Status --- Chassis1  (Online Chassis)

 

 

<sw79>display irf link 
Member 2
 IRF Port  Interface                             Status
 1         FortyGigE2/0/0/1(MDC1)                DOWN  
 2         disable                               --    


<sw79>display irf topolo
                              Topology Info
 -------------------------------------------------------------------------
               IRF-Port1                IRF-Port2          
 MemberID    Link       neighbor      Link       neighbor    Belong To
 2           DOWN       ---           DIS        ---         00e0-fc0f-8c23



<sw79>display link-aggregation verbose Bridge-Aggregation 2
Loadsharing Type: Shar -- Loadsharing, NonS -- Non-Loadsharing 
Port Status: S -- Selected, U -- Unselected, I -- Individual 
Port: A -- Auto port, M -- Management port, R -- Reference port 
Flags:  A -- LACP_Activity, B -- LACP_Timeout, C -- Aggregation, 
        D -- Synchronization, E -- Collecting, F -- Distributing, 
        G -- Defaulted, H -- Expired 

Aggregate Interface: Bridge-Aggregation2
Creation Mode: Manual
Aggregation Mode: Dynamic
Loadsharing Type: Shar
Management VLANs: None
System ID: 0x8000, 2c23-3a01-0000
Local: 
  Port                Status   Priority Index    Oper-Key               Flag
  XGE2/1/0/21         U        32768    105      105                    {AC}
Remote: 
  Actor               Priority Index    Oper-Key SystemID               Flag   
  XGE2/1/0/21         0        50       290      0x2bc0, c091-34e3-2bc0 {ACEF}


<sw79>display interface FortyGigE 2/0/0/1 brief
Brief information on interfaces in bridge mode:
Link: ADM - administratively down; Stby - standby
Speed: (a) - auto
Duplex: (a)/A - auto; H - half; F - full
Type: A - access; T - trunk; H - hybrid
Interface            Link Speed   Duplex Type PVID Description                
FGE2/0/0/1           DOWN auto    A      --   --   IRF-Link to sw79

 

 

What i have seen is that on Chassis1  there is 

 

<sw79>display mad verbose 
Multi-active recovery state: No

 

 

 

and on Chassis 2

 

<sw79>display mad verbose 
Multi-active recovery state: Yes

 

HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

Hi @CR85 !

Thank you for the information provided!

Could you post output from the following commands:

Chassis1:
=========
display system-working-mode
display ecmp mode
display interface FortyGigE1/0/0/1
display transceiver diagnosis interface FortyGigE1/0/0/1
display transceiver alarm interface FortyGigE1/0/0/1

Chassis2:
=========
display system-working-mode
display ecmp mode
display interface FortyGigE2/0/0/1
display transceiver diagnosis interface FortyGigE2/0/0/1
display transceiver alarm interface FortyGigE2/0/0/1

 

 

I am an HPE employee

Accept or Kudo

HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

just a note while waiting for the output - at this time Chassis 2 is in recovery state and all its ports except  FortyGigE2/0/0/1. MAD is in faulty state, because XGE2/1/0/21  is down - disabled by the MAD (that we are going to verify when I get the output of requested commands). Until you restore the IRF link, e.g. restore physical link between FortyGigE1/0/0/1 and FortyGigE2/0/0/1, the Chassis 2 will stay isolated - Master role, but all interfaces except FortyGigE2/0/0/1 are down to minimize network impact of split-brain.

That is why I requested the output from display transceiver command to see what is going on with physical link between FortyGigE1/0/0/1 and FortyGigE2/0/0/1. If we will see no alarms and normal Tx and Rx signal levels, but ports down, try to re-plug the transceiver on Chassis 1 and check if interface will go up. Monitor for log messages on both chassis - if they sence that the IRF physical port goes up, they should try to form IRF and if that fail there must be an error message with the reason.

 

 

I am an HPE employee

Accept or Kudo

Frequent Visitor

Re: IRF-Fabric in Split brain but no restore possible?

Firts of all 

Big thx for your hlep.

Here the output you asked for.

 

Switch 1:

<sw79>display system-working-mode
The current system working mode is standard.
The system working mode for next startup is standard.

<sw79>display ecmp mode 
  ECMP-Mode in use: Default 
  ECMP-Mode at the next reboot: Default

<sw79>display interface FortyGigE 1/0/0/1
FortyGigE1/0/0/1
Current state: DOWN
IP packet frame type: Ethernet II, hardware address: 2c23-3a00-ff80
Description: IRF-Link to sw83
Bandwidth: 40000000 kbps
Loopback is not set
Media type is stack wire, port hardware type is STACK_QSFP_PLUS
Ethernet port mode: LAN
Unknown-speed mode, unknown-duplex mode
Link speed type is autonegotiation, link duplex type is autonegotiation
Maximum frame length: 1500
MDI type: Automdix
Last link flapping: 2 days 6 hours 8 minutes
Last clearing of counters: Never
 Peak input rate: 46 bytes/sec, at 2020-11-20 10:06:27 
 Peak output rate: 116 bytes/sec, at 2020-11-20 10:06:27 
 Last 300 second input: 0 packets/sec 0 bytes/sec -%
 Last 300 second output: 0 packets/sec 0 bytes/sec -%
 Input (total):  28 packets, 13931 bytes
         0 unicasts, 0 broadcasts, 6 multicasts, 0 pauses
 Input (normal):  6 packets, - bytes
         0 unicasts, 0 broadcasts, 6 multicasts, 0 pauses
 Input:  0 input errors, 0 runts, 0 giants, 0 throttles
         0 CRC, 0 frame, - overruns, 0 aborts
         - ignored, - parity errors
 Output (total): 80 packets, 35030 bytes
         22 unicasts, 0 broadcasts, 58 multicasts, 0 pauses
 Output (normal): 80 packets, - bytes
         22 unicasts, 0 broadcasts, 58 multicasts, 0 pauses
 Output: 0 output errors, - underruns, 0 buffer failures
         0 aborts, 0 deferred, 0 collisions, 0 late collisions
         0 lost carrier, - no carrier

<sw79>display transceiver diagnosis interface FortyGigE 1/0/0/1
The transceiver does not support this function.

<sw79>display transceiver alarm interface FortyGigE 1/0/0/1
FortyGigE1/0/0/1 transceiver current alarm information:
  None

 

Switch2: 

<sw79>display system-working-mode 
The current system working mode is standard.
The system working mode for next startup is standard.

<sw79>display ecmp mode 
  ECMP-Mode in use: Default 
  ECMP-Mode at the next reboot: Default

<sw79>display interface FortyGigE 2/0/0/1
FortyGigE2/0/0/1
Current state: DOWN
IP packet frame type: Ethernet II, hardware address: 2c23-3a01-0000
Description: IRF-Link to sw79
Bandwidth: 40000000 kbps
Loopback is not set
Media type is stack wire, port hardware type is STACK_QSFP_PLUS
Ethernet port mode: LAN
Unknown-speed mode, unknown-duplex mode
Link speed type is autonegotiation, link duplex type is autonegotiation
Maximum frame length: 1500
MDI type: Automdix
Last link flapping: Never
Last clearing of counters: Never
 Peak input rate: 0 bytes/sec, at 2020-11-20 11:42:23 
 Peak output rate: 0 bytes/sec, at 2020-11-20 11:42:23 
 Last 300 second input: 0 packets/sec 0 bytes/sec -%
 Last 300 second output: 0 packets/sec 0 bytes/sec -%
 Input (total):  0 packets, 0 bytes
         0 unicasts, 0 broadcasts, 0 multicasts, 0 pauses
 Input (normal):  0 packets, - bytes
         0 unicasts, 0 broadcasts, 0 multicasts, 0 pauses
 Input:  0 input errors, 0 runts, 0 giants, 0 throttles
         0 CRC, 0 frame, - overruns, 0 aborts
         - ignored, - parity errors
 Output (total): 0 packets, 0 bytes
         0 unicasts, 0 broadcasts, 0 multicasts, 0 pauses
 Output (normal): 0 packets, - bytes
         0 unicasts, 0 broadcasts, 0 multicasts, 0 pauses
 Output: 0 output errors, - underruns, 0 buffer failures
         0 aborts, 0 deferred, 0 collisions, 0 late collisions
         0 lost carrier, - no carrier

<sw79>display transceiver diagnosis interface FortyGigE 2/0/0/1
The transceiver does not support this function.

<sw79>display transceiver alarm interface FortyGigE 2/0/0/1
FortyGigE2/0/0/1 transceiver current alarm information:
  None

 

When i am right the IRF-Link is with an DAC-Cable. 

 

HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

Yes, it's a DAC cable, that's why there is no monitoring values. I am wondering why the link stays down, because I think this is the root cause - IRF link stays down->IRF is not formed->split-brain->MAD disabled ports on Chassis 2. 

Try to re-plug the transceiver on Chassis 1  and 2 and check if link will go up. Monitor for log messages on both chassis - if they sence that the IRF physical port goes up, they should try to form IRF and if that fail there must be an error message with the reason. Do you have a spare DAC to try, just in case?

I am an HPE employee

Accept or Kudo

Frequent Visitor

Re: IRF-Fabric in Split brain but no restore possible?

Hi Ivan, 


Try1: 

* We have reseated the DAC-Cable on both switches with no success.

Try2:
* We have unplugged the DAC-Cable
* Inserted new GBICS into those Ports and connected an Cable with no success.


On Try2 we have seen a short time that the Ports are going online but after 8 Seconds the Ports are going offline.

Logfile sw76 - Active: 

##########
Try2
##########
%Nov 23 13:32:08:681 2020 sw79 LAGG/4/LACP_MAD_INTERFACE_CHANGE_STATE: LACP MAD function enabled on Bridge-Aggregation2 changed to the faulty state.
%Nov 23 13:27:45:171 2020 sw79 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE1/0/0/1 changed to down.
%Nov 23 13:27:45:171 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE1/0/0/1 changed to down.
%Nov 23 13:27:37:533 2020 sw79 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE1/0/0/1 changed to up.
%Nov 23 13:27:37:532 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE1/0/0/1 changed to up.
%Nov 23 13:27:31:323 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux1/0-IPAddr=**-User=**; Command is display interface brief | incl FGE
%Nov 23 13:27:06:331 2020 sw79 OPTMOD/4/MODULE_IN: -Chassis=1-Slot=0; FortyGigE1/0/0/1: The transceiver is 40G_BASE_LR4_QSFP_PLUS.
%Nov 23 13:26:47:853 2020 sw79 OPTMOD/4/MODULE_OUT: -Chassis=1-Slot=0; FortyGigE1/0/0/1: Transceiver absent.
%Nov 23 13:24:31:652 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux1/0-IPAddr=**-User=**; Command is display interface brief | incl FGE
%Nov 23 13:22:56:470 2020 sw79 OPTMOD/4/MODULE_IN: -Chassis=1-Slot=0; FortyGigE1/0/0/12: The transceiver is 40G_BASE_SR4_QSFP_PLUS.
%Nov 23 13:22:32:490 2020 sw79 OPTMOD/4/MODULE_OUT: -Chassis=1-Slot=0; FortyGigE1/0/0/12: Transceiver absent.

##########
Try 1
##########

%Nov 23 13:11:57:536 2020 sw79 OPTMOD/4/MODULE_IN: -Chassis=1-Slot=0; FortyGigE1/0/0/1: The transceiver is STACK_QSFP_PLUS.
%Nov 23 13:08:50:726 2020 sw79 OPTMOD/4/MODULE_OUT: -Chassis=1-Slot=0; FortyGigE1/0/0/1: Transceiver absent.

 

Logfile of sw79 - Inactive 

%@176048%Nov 23 13:07:36:046 2020 sw79 SHELL/5/SHELL_LOGIN: TTY logged in from aux2/0.
%@176049%Nov 23 13:08:27:693 2020 sw79 OPTMOD/4/MODULE_OUT: -Chassis=2-Slot=0; FortyGigE2/0/0/1: Transceiver absent.
%@176050%Nov 23 13:11:50:370 2020 sw79 OPTMOD/4/MODULE_IN: -Chassis=2-Slot=0; FortyGigE2/0/0/1: The transceiver is STACK_QSFP_PLUS.
%@176051%Nov 23 13:15:21:598 2020 sw79 SHELL/6/SHELL_CMD: -Line=aux2/0-IPAddr=**-User=**; Command is display interface brief
%@176052%Nov 23 13:26:59:356 2020 sw79 OPTMOD/4/MODULE_OUT: -Chassis=2-Slot=0; FortyGigE2/0/0/1: Transceiver absent.
%@176053%Nov 23 13:27:02:127 2020 sw79 SHELL/5/SHELL_LOGOUT: TTY logged out from aux2/0.
%@176054%Nov 23 13:27:09:259 2020 sw79 SHELL/5/SHELL_LOGIN: TTY logged in from aux2/0.
%@176055%Nov 23 13:27:27:247 2020 sw79 OPTMOD/4/MODULE_IN: -Chassis=2-Slot=0; FortyGigE2/0/0/1: The transceiver is 40G_BASE_LR4_QSFP_PLUS.
%@176056%Nov 23 13:27:29:176 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE2/0/0/1 changed to up.
%@176057%Nov 23 13:27:29:178 2020 sw79 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE2/0/0/1 changed to up.
%@176058%Nov 23 13:27:29:219 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE2/0/0/1 changed to down.
%@176059%Nov 23 13:27:29:220 2020 sw79 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE2/0/0/1 changed to down.

After reseating the transceiver with the DAC cable the inactive Switch rebooted.

This is mentioned in the logfiles but nothing changed.

 

HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

Hi Christoph!

I have requested "display diag" outputs in PM, let's see, maybe there will be some additional info that could point us to the root cause.

 

I am an HPE employee

Accept or Kudo

HPE Pro

Re: IRF-Fabric in Split brain but no restore possible?

Hi @CR85 !

I have checked the files you uploaded. Actually the most interesting file is the "after_cabletest" one - only it contains traces of the event when you tried optical transceiver instead of DAC and I see how STM process (main driver of IRF stack) on Chassis 2 detected IRF packets from Chassis 1:

%@176442%Nov 23 13:42:40:428 2020 sw79 IFNET/3/PHY_UPDOWN: Physical state on the interface FortyGigE2/0/0/1 changed to up.
%@176443%Nov 23 13:42:40:430 2020 sw79 IFNET/5/LINK_UPDOWN: Line protocol state on the interface FortyGigE2/0/0/1 changed to up.
%@176444%Nov 23 13:42:42:981 2020 sw79 STM/6/STM_LINK_UP: IRF port 1 came up.

Then Chassis 2 restarted in order to converge:

%@176446%Nov 23 13:45:37:886 2020 HPE SYSLOG/6/SYSLOG_RESTART: System restarted --

However after the restart I see no message that Fourty2/0/0/1 went UP, seems like after the restart it's still DOWN and that's why MAD kicks in:

%@176813%Nov 23 13:57:27:129 2020 sw79 DEV/1/MAD_DETECT: Multi-active devices detected, please fix it.

Unfortunately neither of diags contains traces if DACs made any attempt to come up. I am starting to assume the issue with P07 patch...

Could you do me a favor and generate new "display diags" from Chassis 1 and 2, because neither of both contains messages from that period of time?
Also, each MPU on both Chassis has  two directories - 'logfile' and 'diagfile'. Could you retrieve the contents of those directories, ZIP them into different archives and name the archives to reflect the source of those files, for example:

chassis1#slot10#flash:/logfile/*.*
chassis1#slot10#flash:/diagfile/*.*

pack to ZIP archive named 'chassis1slot10.zip'

chassis1#slot11#flash:/logfile/*.*
chassis1#slot11#flash:/diagfile/*.*

pack to ZIP archive named 'chassis1slot11.zip'

etc... or make one archive with directory structure inside that will let me know which file from which filesystem comes.  I am trying to find that error message of STM process whith the reason why it didn't form the stack during the test with optical transceivers.

 

I am an HPE employee

Accept or Kudo