- Integrated Systems
- About Us
- Integrated Systems
- About Us
08-18-2011 09:09 AM
5406zl spanning tree issues
About 30 months ago I replaced a stack of various 10/100 HP, Extreme and 3Com 10/100 switches with 5 of the 5406zl switches in my core environment for server connectivity..At the time I did this we were only doing single NIC connections.
I have replaced this equipment with warranty replacements several times..To the tune of having about 12 different pieces of hardware. Started with my original 6, replaced 1 for a vlan tagging issue, replaced 2 for inability to upload firmware, received 2 as replacements, sent one of the replacements back DOA, installed 1 in the environment and when I connected anything to it, it took down my entire environment for about 20-30 minutes..I expected seconds for spanning tree re-convergence. I am afraid to plug anything in to this switch while it is connected to my production environment..Cannot reproduce this problem when it is a stand-a-lone device.
We brought in a new server guy and he wanted to establish 5 connections per server, dual NIC's with each NIC pair "teamed" and 1 connection for Management (IMM)...
Then he started to implement Clustered servers working towards a Disaster Recovery site with "GEO-Clustering"..Clusters across the WAN. Then he implemented Virtual servers as well. We started to experience loss of connectivity to the servers. Exchange cluster would drop offline, my Citrix clusters would become an "unavailable resource" and users could attach sometimes. I am also using an Avaya VoIP system...phone system and Voicemail servers appear to be impacted but telephones do not.
When I log at the log files I saw multiple instances of spanning tree shutting down a port, then re-enabling it and then at some undetermined time, it would happen again. This was happening across my entire server environment at no regular interval or time span...sometimes I would go 8-12 hours with no events, then several hours of nothing but events and then back to nothing...a very unstable environment.
During our troubleshooting process we discovered that if we removed one of the "teams", the system would come back online and stay online...we removed the multiple paths. I found a reference to auto edge-port (Cisco portfast)..So I made all of the ports except the trunk ports edge ports...to host servers and NOT participate in the spanning tree environment...
sw-1 is the head end switch which connects to my router,
sw-1 connects via a trunk pair to sw-2 and trunk pair to sw-4
sw-2, sw-3, sw-4, sw-5 all have trunk pairs (2 ports) to each of the other switches spread across the blades so that no single blade failure will take down a trunk...a full mesh.
sw-2 has a pair that connects to sw-3, a pair to sw-4, a pair to sw-5
sw-3 has a pair which connects to sw-4, a pair to sw-5
sw-4 has a pair which connects to sw-5
This configuration should give me a full mesh environment.
I have configure Multiple Rapid Spanning Tree instances on sw-2 and sw-4 with my vlans built on both switches alternating Owner's and Backup...half are owned by MST1 and the other half are owned by MST2 and Backup defined accordingly.
Sw-2 and sw-4 have VRRP configured so that if either switch drops dead for whatever reason I still have a route to the head-end switch -1.
After all of this we were still experiencing server connectivity issues.
Last Tuesday I came in bright and early...before the production day started and removed all of the redundant links except the pair to the head end...which knows has two "stacks" connected to the Head-end...
sw-3 is connected to sw-2 is connected to sw-1
sw-5 is connected to sw-4 is connected to sw-1
My spanning tree issues seemed to be resolved, the environment "appears" to have stabilized, servers are not loosing connectivity to other servers in "their" clusters, end users are not calling in to the helpdesk reporting "unavailable resource", "slow" email or anything similar to what they have been reporting over the last couple of months.
first question...am I using the right equipment for my "core" switching environment or do I need to consider using a higher end switch?
does anybody else out there have a similar environment and would they be willing to talk to me...if so, I can provide email or phone number as requested.