How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes speaker?

RichardYu · ‎03-26-2021

When the "hold time expired" occurs in the peer link, the switch BGP state machine is back in the IDLE state. It will stuck in the IDLE until the user do "clear bgp neighbor_IP_address". Even I have restart the kubernetes speaker pod, the peer link between the kubernetes speeaker and the Aruba 8320 is still NOT estabished.

the speaker log showed that it received "connection reset by peer" about every 90 seconds and the switch BGP debug log showed detecting "A connection's FSM state has deteriorated" about every 15 seconds. I wonder what is the correct way to the "hold time expired". Should the switch BGP agent restart FSM after couple "connection reset"? should it be considered as bug in Speaker or switch? or it is working as design? Thanks

The following is log from Kubernetes speaker log:

{"caller":"bgp.go:58","error":"read OPEN from \"10.252.0.2:179\": read tcp 10.252.0.17:51963-\u003e10.252.0.2:179: read: connection reset by peer","localASN":65533,"msg":"failed to connect to peer","op":"connect","peer":"10.252.0.2:179","peerASN":65533,"ts":"2021-03-24T15:32:08.897404559Z"}
{"caller":"bgp.go:58","error":"read OPEN from \"10.252.0.2:179\": read tcp 10.252.0.17:43437-\u003e10.252.0.2:179: read: connection reset by peer","localASN":65533,"msg":"failed to connect to peer","op":"connect","peer":"10.252.0.2:179","peerASN":65533,"ts":"2021-03-24T15:34:08.89886829Z"}

The following is from switch BGP debug log:

2021-03-24:15:01:43.038654|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|FSM Input                  = 0X08VRF Name = default.
2021-03-24:15:01:43.038630|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|New state                  = 0X00
2021-03-24:15:01:43.038606|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Previous state             = 0X01
2021-03-24:15:01:43.038582|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Incoming?                  = False
2021-03-24:15:01:43.038559|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Scope ID                   = 0
2021-03-24:15:01:43.038535|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote port                = 0
2021-03-24:15:01:43.038512|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote address             = 10.252.0.17
2021-03-24:15:01:43.038487|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local port                 = 0
2021-03-24:15:01:43.038463|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local address              = (none)
2021-03-24:15:01:43.038437|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Entity index               = 268763136
2021-03-24:15:01:43.038407|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|A connection's FSM state has deteriorated.

Thank you!

akg7 · ‎03-28-2021

Hello,

Seems device is not getting response from neighbors thats why it is reaching to hold down time expire.

I have some queries:

1. Is BGP neighborship was established earlier?

2. If yes, any recent changes made?

3. What is product number of the device 'JXXXXX' and which one is the neighbor device?

4. Please share 'show log' or 'display log' output from the device?

5. Have you tried to swap the physical cables?

6. What is the running software version?

Thanks!

Note: While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company

RichardYu · ‎03-29-2021

Hi akg7 HPE pro

Thank you very much for the respond.

That is correct "the device is not geting response from neighbors that is why it is reaching to hold time expire".

My question is how to recover from "the hold time expire". What I saw the symptom is that I left the cluster over night, and saw some of my BGP links are in "idle" state in the morning. The switch BGP debug log showed "hold time expired" error. I think there is a problem in the cluster, which we are debugging, causing the Kubernetes Speaker pod not responding/spending "keepalive" for 90 seconds, Thus causing the "hold time expired". However, the cluster is healing itself, but the BGP peer link is not. In the moring, I did not need to touch anything in the Cluster side, just do "clear bgp xxx.xxx.xxx.xxx (neighbor ip address)" the link is re-established.

My answers of your questions are the following:

1. Is BGP neighborship was established earlier?

Yes. the BGP neighborship was established earlier.

2. If yes, any recent changes made?

I think there is a problem in the cluster causing the kubernetes speaker pod hung for while over night.

3. What is product number of the device 'JXXXXX' and which one is the neighbor device?

Service OS Version : TL.01.05.0003
BIOS Version : TL-01-0013
kibo-mgmt-sw0# show device
Invalid input: devic
kibo-mgmt-sw0# show system
Hostname : kibo-mgmt-sw0
System Description : TL.10.05.0001
System Contact :
System Location :

Vendor : Aruba
Product Name : JL581A 8320
Chassis Serial Nbr : TW04KCW00G
Base MAC Address : 20677c-539fc0
ArubaOS-CX Version : TL.10.05.0001

In the log:

10.252.0.17 is neighbor device that is a Kubernetes speak pod IP address.

10.252.0.2 is the Aruba switch IP address.

4. Please share 'show log' or 'display log' output from the device?

2021-03-24:15:01:43.038654|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|FSM Input                  = 0X08VRF Name = default.
2021-03-24:15:01:43.038630|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|New state                  = 0X00
2021-03-24:15:01:43.038606|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Previous state             = 0X01
2021-03-24:15:01:43.038582|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Incoming?                  = False
2021-03-24:15:01:43.038559|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Scope ID                   = 0
2021-03-24:15:01:43.038535|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote port                = 0
2021-03-24:15:01:43.038512|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote address             = 10.252.0.17
2021-03-24:15:01:43.038487|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local port                 = 0
2021-03-24:15:01:43.038463|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local address              = (none)
2021-03-24:15:01:43.038437|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Entity index               = 268763136
2021-03-24:15:01:43.038407|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|A connection's FSM state has deteriorated.
2021-03-24:15:01:28.036738|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|FSM Input                  = 0X08VRF Name = default.
2021-03-24:15:01:28.036714|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|New state                  = 0X00
2021-03-24:15:01:28.036690|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Previous state             = 0X01
2021-03-24:15:01:28.036665|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Incoming?                  = False
2021-03-24:15:01:28.036641|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Scope ID                   = 0
2021-03-24:15:01:28.036617|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote port                = 0
2021-03-24:15:01:28.036594|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote address             = 10.252.0.17
2021-03-24:15:01:28.036569|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local port                 = 0
2021-03-24:15:01:28.036545|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local address              = (none)
2021-03-24:15:01:28.036520|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Entity index               = 268763136
2021-03-24:15:01:28.036483|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|A connection's FSM state has deteriorated.
2021-03-24:15:01:13.037499|hpe-routing|LOG_DEBUG|AMM|-|BGP|BGP_EVENT|send_bgpBackwardTransition Trap Remote Address= 10.252.0.17   LastError = ,  status 1, vrf name t
2021-03-24:15:01:13.037468|hpe-routing|LOG_DEBUG|AMM|-|BGP|BGP_EVENT|Peer status - 1, vrfId - 0
2021-03-24:15:01:13.035246|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Entity index:   268763136VRF Name = default.
2021-03-24:15:01:13.035222|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Passive:?       False
2021-03-24:15:01:13.035199|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Neg hold time:  90
2021-03-24:15:01:13.035176|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Scope ID:       0
2021-03-24:15:01:13.035153|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote port:    0
2021-03-24:15:01:13.035129|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote address: 10.252.0.17
2021-03-24:15:01:13.035105|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local port:     0
2021-03-24:15:01:13.035081|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local address:  (none)
2021-03-24:15:01:13.035056|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|A connection has left Established state.
2021-03-24:15:01:13.034713|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|FSM Input                  = 0X03VRF Name = default.
2021-03-24:15:01:13.034690|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|New state                  = 0X00
2021-03-24:15:01:13.034666|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Previous state             = 0X06
2021-03-24:15:01:13.034642|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Incoming?                  = False
2021-03-24:15:01:13.034618|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Scope ID                   = 0
2021-03-24:15:01:13.034595|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote port                = 0
2021-03-24:15:01:13.034572|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Remote address             = 10.252.0.17
2021-03-24:15:01:13.034548|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local port                 = 0
2021-03-24:15:01:13.034525|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Local address              = (none)
2021-03-24:15:01:13.034501|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|Entity index               = 268763136
2021-03-24:15:01:13.034476|hpe-routing|LOG_INFO|AMM|-|BGP|BGP|A connection's FSM state has deteriorated.
2021-03-24:15:01:13.034121|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Error subcode         = Unspecific (0)VRF Name = default.
2021-03-24:15:01:13.034097|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Error code            = Hold Timer Expired (4)
2021-03-24:15:01:13.034074|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Remote BGP ID         = 10.252.0.17
2021-03-24:15:01:13.034051|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Remote AS number      = 65533
2021-03-24:15:01:13.034026|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Scope ID              = 0
2021-03-24:15:01:13.033996|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Remote port           = 0
2021-03-24:15:01:13.033973|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Remote address        = 10.252.0.17
2021-03-24:15:01:13.033950|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Local port            = 0
2021-03-24:15:01:13.033927|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|Local address         = (none)
2021-03-24:15:01:13.033903|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|NM entity index       = 268763136
2021-03-24:15:01:13.033878|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|problem.
2021-03-24:15:01:13.033848|hpe-routing|LOG_ERR|AMM|-|BGP|BGP|A NOTIFICATION message is being sent to a neighbor due to an unexpected

5. Have you tried to swap the physical cables?

I did not swap the physical cables because I did not think this is a cable problem. I can recover by "clear BGP xxx.xxx.xxx.xx" without touch any hardware.

6. What is the running software version?

kibo-mgmt-sw0# show version
-----------------------------------------------------------------------------
ArubaOS-CX
(c) Copyright 2017-2020 Hewlett Packard Enterprise Development LP
-----------------------------------------------------------------------------
Version      : TL.10.05.0001
Build Date   : 2020-07-09 18:09:54 PDT
Build ID     : ArubaOS-CX:TL.10.05.0001:53cb98af4936:202007092355
Build SHA    : 53cb98af4936ce4b2e61fb4bb9dde7e7925c5e29
Active Image : secondary

Service OS Version : TL.01.05.0003
BIOS Version       : TL-01-0013

Thank you very much for your help.

akg7 · ‎04-02-2021

Hello @RichardYu,

Apologies for delayed response.

I am seeing here Notification message was sent and FSM was deteriorated.

Keepalive ensures that BGP neighbors are still alive and the default interval is 60sec's. If a device does not receive a keepalive from a peer for hold-time period then the device declares that peer is dead. Default hold-time is 180 sec's.

I don't think so recover from hold time will help you here. This issue needs more investigation from support.

I request you to log a case on HPE Support Center portal for further resolution using the link: https://support.hpe.com/hpesc/public/home/

Thanks!

Note: While I am an HPE Employee, all of my comments (whether noted or not), are my own and are not any official representation of the company

RichardYu · ‎04-05-2021

Thank you, akg7

i will open a case for support.

Thanks again.

Richard.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes speaker?

How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes speaker?

Re: How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes spea

Re: How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes spea

Re: How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes spea

Re: How to recover from hold time expired for BGP peer link between Aruba switch and Kubernetes spea