Re: A daily switch lockup, same time every day

kevets · ‎05-14-2015

I have a set of Procurve switches (with a few Aruba's also) that are starred off of a core.

I work at a zoo (literally) that is open 363 days a year. We open to the public at 9:30.

Every day, 7 days a week, between 8:50 and 9:40 am a set of switches connected to the core light all link lights solid. The switch logs indicate high collisions, BPDU starved for a receive, CST election queries, LACP and STP blocking. Many of my users are unaware, but it kills the performance for others.

A switch reboot of one, sometimes two of the effected switches clears the problem for the day. Then the next day, boom, same thing at roughly the same time.

I have tried admin-edge-port, bpdu-protect, etc. thinking this was perhaps a spanning tree problem. To no avail. I have tried disabling ports, and walking around at the problem times to see who just powered something on. To no avail.

Anybody have any good ideas? I am wondering if it's an ARP table size limit being hit? These are 2810's connected to a 3500. Up-to-date firmware. This daily need for a reset is getting tiresome!

Vince-Whirlwind · ‎05-14-2015

If that was my network, I would be straightaway be thinking that somebody is switching something on at that time of the morning which is causing the issue.

I would also be thinking that it could be a device that is causing a loop on the network, but only while it boots up - this would be a device that has access to multiple switchports, or, is attached to multiple VLANs and is temporarily bridging them while it boots up.

More likely though, I would be thinking that some kind of multicast is being generated as something is booted up, which is not liked by either one of your switches.

I would therefore be having a close look at the environment to try to understand what events are taking place at this time of day.

Also, I would ensure my switches are logging to a syslog server, so I could access the messages that have been generated when the switch locks up.

I might even do a traffic capture to see if I can see anything odd, although until you have some idea of where/what the culprit is, this is a needle in the haystack.

I'd be looking at configuring multicast throughout the network (switch querier off on all the edge switches).

Those are the first few ideas I'd have. From there you should be able to narrow it down.

kevets · ‎05-15-2015

Thanks VInce!

The "attached to multiple vlans" is an interesting possibility. I have wireless Access Points on VLAN 114. The only issue is, these are on 24/7. But perhaps a laptop is connecting and bridging both, although I can't quite figure out how that would be, or why only 1 device would cause problems at boot-up, as I have a lot of wireless connections.

Since we're a non-profit, it's me and two other guys to manage a lot of diverse stuff. None of us it that up on network engineering. I am doing syslogging, and this has given me the insight I have. I also notice that the LACP errors and the out of packet buffer errors get eaten in the storm, so I still need to login and review logs on a switch by switch basis.

Given my relative lack of skills here, can you tell me more about multicast?

And I am, to use a data warehousing term, a snowflake topology. I have a daisy chain of 5 x 6200's and 1 3500, to which are connected edge switches (2810's, 2530, 2510 and some Aruba S1500). In a few cases, I have another hop to leaf nodes.

Is it safe (or a good/bad idea) to turn off STP on the leaf nodes? It seems like these switches, which only have a single uplink, have a simple job and STP is not needed for them? I ask, because the morning storm has a few 2810's announce "Starved for a BPDU rx" and then they go into CST election.

kevets · ‎05-15-2015

I syslog and take the results into SQL server so I can sift them down. Today's event began at 9:33. Syslog entries that match:

Where

Message like '%CST%'

or Message like '%CIST%'

or Message like '%drop%'

or Message like '%system:%'

or Message like '%Blocked by LACP%'

or Message like '%CRC%'

are, in reverse chronological order:

May 15 09:43:53 10.110.138.101 03362 auth: User 'admin' logged in from 10.1.0.136 to TELNET session
May 15 09:43:53 10.110.138.101 00179 mgr: SME TELNET from 10.1.0.136 - MANAGER Mode
May 15 09:39:26 10.110.138.101 00077 ports: port 1 is now off-line
May 15 09:39:08 10.110.138.101 00076 ports: port 8 is now on-line
May 15 09:39:02 10.110.138.101 00435 ports: port 8 is Blocked by STP
May 15 09:38:59 10.110.138.101 00435 ports: port 17 is Blocked by STP
May 15 09:38:59 10.110.138.101 00076 ports: port 17 in Trk4 is now on-line
May 15 09:38:57 10.110.138.165 FFI: port 17-High collision or drop rate. See help.
May 15 09:38:43 10.110.138.101 00077 ports: port 8 is now off-line
May 15 09:38:11 10.110.138.103 ports: port 48 is Blocked by LACP
May 15 09:38:02 10.110.138.128 ports: port 47 is Blocked by LACP
May 15 09:37:59 10.110.138.101 00435 ports: port 17 is Blocked by LACP
May 15 09:37:27 10.110.138.101 00236 snmp: Security access violation from 10.10.2.87 for the community name or user name : Adm1n2
May 15 09:36:30 10.110.138.128 ports: port 47 is Blocked by LACP
May 15 09:36:09 10.110.138.165 FFI: port 17-High collision or drop rate. See help.
May 15 09:35:52 10.110.138.177 00331 FFI: port 4-High collision or drop rate. See help.
May 15 09:35:54 10.110.138.128 FFI: port 3-High collision or drop rate. See help.
May 15 09:35:54 10.110.138.128 FFI: port 13-High collision or drop rate. See help.
May 15 09:35:54 10.110.138.128 FFI: port 31-High collision or drop rate. See help.
May 15 09:35:54 10.110.138.128 FFI: port 45-High collision or drop rate. See help.
May 15 09:35:44 10.110.138.107 FFI: port 3-High collision or drop rate. See help.
May 15 09:35:44 10.110.138.107 FFI: port 22-High collision or drop rate. See help.
May 15 09:35:44 10.110.138.107 FFI: port 24-High collision or drop rate. See help.
May 15 09:35:24 10.110.138.110 FFI: port 1-High collision or drop rate. See help.
May 15 09:35:24 10.110.138.110 FFI: port 10-High collision or drop rate. See help.
May 15 09:35:24 10.110.138.110 FFI: port 11-High collision or drop rate. See help.
May 15 09:35:24 10.110.138.110 FFI: port 32-High collision or drop rate. See help.
May 15 09:35:24 10.110.138.110 FFI: port 43-High collision or drop rate. See help.
May 15 09:35:03 10.110.138.101 00435 ports: port 22 is Blocked by LACP
May 15 09:34:56 10.110.138.101 00076 ports: port 17 in Trk4 is now on-line
May 15 09:34:56 10.110.138.101 00076 ports: port 18 in Trk4 is now on-line
May 15 09:34:39 10.110.138.117 FFI: port 19-High collision or drop rate. See help.
May 15 09:34:39 10.110.138.117 FFI: port 20-High collision or drop rate. See help.
May 15 09:34:38 10.110.138.128 stp: CIST starved for a BPDU Rx on port Trk4 from 0:001c2e-16f4c0
May 15 09:34:38 10.110.138.128 stp: CST Root changed from 0:001c2e-16f4c0 to 32768:0017a4-0a76c0
May 15 09:34:37 10.110.138.101 00435 ports: port 17 is Blocked by STP
May 15 09:34:37 10.110.138.101 00435 ports: port 18 is Blocked by STP
May 15 09:34:33 10.110.138.128 ports: port 47 is Blocked by LACP
May 15 09:33:35 10.110.138.103 ports: port 48 is Blocked by LACP
May 15 09:33:22 10.110.138.103 FFI: port 7-High collision or drop rate. See help.
May 15 09:33:22 10.110.138.103 FFI: port 12-High collision or drop rate. See help.
May 15 09:33:22 10.110.138.103 FFI: port 15-High collision or drop rate. See help.
May 15 09:33:22 10.110.138.103 FFI: port 17-High collision or drop rate. See help.
May 15 09:33:22 10.110.138.103 FFI: port 34-High collision or drop rate. See help.
May 15 09:33:22 10.110.138.103 FFI: port 36-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 4-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 5-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 14-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 17-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 21-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 38-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 40-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 41-High collision or drop rate. See help.
May 15 09:33:13 10.110.138.102 FFI: port 46-High collision or drop rate. See help.
May 15 09:33:00 10.110.138.128 ports: port 47 is Blocked by LACP

So this pattern seems typical - my problems start with LACP shutting an edge switch down. the 128 is connected to the 101 switch across Trk4, which is ports 47-48. Yet I see only port 47 blocked, and then a broadcast storm across the other edge switches.

Why is LACP blocking, and why is it blocking only 1 interface in a dual-interface trunk?

Michael Marshall_5 · ‎05-15-2015

I remember being bitten by something similar to this, which turned out to be due to a limitation in STP. There's a limitation of 7 hops max between any two devices which are part of the same tree. What happens when you go over that is, I think, officially "undefined". In practice things break.

Might be worth checking what HP state the STP limit is, and whether you're at or close to it.

kevets · ‎05-15-2015

Thanks michael. I remember reading somethin similar. I think the max I have is 7, but it gets a little murky with VMWare servers acting as switches. And also what you count as a hop.

My furthest is:

PC
switch 137
Switch 165
Switch 116
Switch 112
Switch 109
Switch 100
Switch 101
ESX VMWare for, example, my file server

So, hmmm, you might be on to something.

Now, the questions is, can I do anything about that?

Michael Marshall_5 · ‎05-15-2015

VMware probably doesn't do anything with STP so can probably be ignored HOWEVER the port(s) connected to the ESX hosts might have been set for portfast (HP calls it something different IIRC). Portfast essentially allows the port to continue forwarding in the event of an STP invokation. The issue with this is that you should never allow a switch with portfast configure ports to become a CST/MSTP root.

I'd have a good read of the Multiple Spanning Tree section of the 2810 manual and cross check what the config values set in the switches are.

I'd check if any ports are set for portfast, and also check which switches are preferred for the spanning tree and IST roots. It's been several years since I worked with HP switches, but I do remember when configuring STP that I had to read the manual section over and over and over to properly understand the concepts!

http://ftp.hp.com/pub/networking/software/2810-MgmtCfg-July2007-59914732.pdf p370 for STP troubleshooting

http://ftp.hp.com/pub/networking/software/2810-AdvTrafficMgmt-July2007-59914733.pdf p101 onwards for STP config options.

The manual suggests the hop limit is 20 for the 2810 however if you have other switches involved then they may have a different limit. If you're really unlucky then everything's using original style STP and not RSTP/MSTP.....

Have fun :-)

kevets · ‎05-15-2015

Thanks.

It was fun for a while - a little mystery. There's been some benefit, in forcing us to clean up and better document our environment.

But it stopped being fun about 8 weeks ago. It's just a drag now, and a daily hassle.

I'm even questioning if I need spanning tree at all. I have no multi-path links. I am attaching a jpeg of my network.

The big boys in my main chain at 6200's, and I'm thinking I should use Layer 3 routing with these.

I think the HP equivalent of portfast is admin-edge-port and bpdu-protect. I have that set on all of my non-uplink ports. I will actually need to verify how the ESX machines are configured, so thanks for that pointer.

Vince-Whirlwind · ‎05-16-2015

I think you are wanting to blame STP, and you shouldn't. STP dioesn't cause issues, although if it isn't properly configured, it can aggravate them, hence Michael's advice that you look at your STP priorities and ensure they are properly planned out:

101 needs to be 0

100 needs to be 1

102 103 105 128 130 can be 4

134 109 can be 5

112 can be 6

116 can be 7

I guess the way I look at it is I set a switch's STP priority higher by 1 for each layer2 hop away from the STP root it is.

As for LACP, blocking one member makes perfect sense: the broadcast storm is coming from one place, with the same destination address, so it will be being directed by LACP to a single member of each LACP trunk where it ends up going.

Why don't you configure a broadcast limit of about 3% on each and every interface that is part of any inter-switch link?

Also, on all your 2810 and 3500s, configure loop protect on all Access ports.

As far as your VMWare environment goes - this is a very likely candidate, because with VMWare you are dealing with server engineers who are used to point-and-click GUIs and just making up their config as they go along - suddenly they are bodging up network config relating to concepts they generally don't understand: VLANs and inter-VLAN routing.
I know for sure of one very large organisation whose network was taken out for about half a day by server engineers frigging with virtual network VLANs and getting it all wrong.

kevets · ‎05-18-2015

I am not looking to blame STP, I am looking to solve my problem. I don't have server engineers to blame, because as I said, as a small non-profit, it's me and 2 other guys responsible for all of it. We know enough to make it work most of the time, but increased complexity comes with increased ignorance and risk. One of our newer (1 year now) complexities is an Aruba WIFI network.

Here's my latest theory: I have some PCs that have both a wired and a wireless connection ( why? Because mistakes were made!). My Private SSID connects through the my default VLAN and gets a DHCP assignment. The Lan connection does the same. So I am seeing multiple IP's for the same PC.

I'm going to correct that today, but could this be the problem?

The WIFI:

- hits an AP

- which connects to the Aruba controller over VLAN 114

- and gets a VLAN 1 address assignment

- WIFI traffic then goes from AP to controller into my switch stack. The controller is not doing STP.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: A daily switch lockup, same time every day

A daily switch lockup, same time every day