ProCurve / ProVision-Based
cancel
Showing results for 
Search instead for 
Did you mean: 

Major Network Issues

SOLVED
Go to solution
dmcdonough
Occasional Advisor

Major Network Issues

Over the past week i have been experiencing network issues. 

Overall Network Slowness, packet drops, RDP connection drops.

I found that when i reboot my main core switch it fixes everything for a few hours before the issues return. 

I have not done any new network or server changes that would create any issues.

I have an HP Procurve layer 3 switch as my core switch. ALl other swiches are HP Procurve (10 total)

I have worked with HP to check logs. We have enabled spanning-tree on all of my switches (i only have 10 total) 

Even after all of that the issues are still coming back. 

Now from what i understand if i had a loop on my network, spanning-tree should disable the ports of where the possible loop is and not cause mass issues across the entire network.

I am assuming that i don't have a loop and that it must be something else causing havoc on my network.

I also have had Microsoft on my servers to check DHCP, DNS, AD Replication on all my domain controllers. 

Any suggestions?

20 REPLIES
dmcdonough
Occasional Advisor

Re: Major Network Issues

core switch is an HP Procurve E5406 zl - J8697A - Running Firmware ver. K.16.02.0012

all other switches are mix of HP Procurve 4104gl and 4108gl switches. we updated the firmware on all downline switches before we enables spanning-tree.

 

i have a running ping going from my workstation to a server on the other side of my core swich. The ping is fine for a few hours and then starts to drop pings and the time starts to grow up to +400-500ms sometimes peaking even higher. This is when the entire network starts to have drops and slowness. IP phone calls between my 2 buildings also start to have call quality issues. i reboot the core switch and then i seem to be ok again for a few hours and the ping goes back to normal. But will eventually happen again.

 

 

parnassus
Honored Contributor

Re: Major Network Issues

Hard to say without any other evidence (Network logical/physical Topology? configuration files? involved Switches' logs? relevant Hosts and Services? Switching/Routing configurations? etc.).

AFAIK actually there is K.16.02.0016 software version.

But...without any correlated logs (especially during the slowdown) it's really difficult to imagine what is the cause (or what are the causes) of your network traffic random slowness (the culprit couldn't be exactly the core Switch...but a service or a host that is impacting - like doing a sort of DoS "Denial of Service" - your core Switch's switching/routing capabilities).

dmcdonough
Occasional Advisor

Re: Major Network Issues

I have a ticket open with HP. we have been working on this the past 3 days. updated all switch firmware, enabled spanning-tree across all switch making the core switch the root priotiy 0.

 

The HP tech recommeded keeping the firmware where it is even though their is a newer version as they reported the version i have is the most stable, knowing good firmware for this model switch.

 

The switch is currently under warranty.

 

I can confirm other than turning spanning tree on no other changes have been made on core switch to cause these issues.

 

Software revision K.16.02.0012

 


Running configuration:

; J8697A Configuration Editor; Created on release #K.16.02.0012
; Ver #0e:01.30.02.34.5f.2c.6b.ff.f7.fc.7f.ff.3f.ef:b5
hostname "CASD-MAIN-01"
module 1 type j9549a
module 2 type j9549a
module 6 type j9538a
trunk A24,B21 trk1 trunk
timesync sntp
sntp unicast
sntp server priority 1 10.0.4.13
time daylight-time-rule continental-us-and-canada
time timezone -300
ip route 0.0.0.0 0.0.0.0 10.0.2.5
ip routing
interface F8
   lacp active
   name "VDI"
   exit
snmp-server enable traps mac-notify
snmp-server enable traps startup-config-change
-- MORE --, next page: Space, next line: Enter, quit: Control-Csnmp-server enable traps running-config-change
snmp-server enable traps mac-count-notify
snmp-server contact "D MCDONOUGH" location "TECH OFFICE LEFT E5406"
vlan 1
   name "DEFAULT_VLAN"
   no untagged A2-A6,A13,A15,A20,B2,B6-B11,B13-B14,B16-B18,F4-F5,F7-F8
   untagged A7-A9,A11,A14,B1,B3-B5,B12,B15,Trk1
   tagged A1,A10,A12,A16-A19,A21-A23,B19-B20,B22-B24,F1-F3,F6
   ip address 192.168.11.1 255.255.0.0
   ip helper-address 10.0.4.13
   ip helper-address 10.0.4.20
   exit
vlan 2
   name "VL_2_Firewall"
   untagged A10,A12
   tagged A1,A20,B19
   ip address 10.0.2.2 255.255.255.0
   exit
vlan 4
   name "VLAN_4_SERVERS"
   untagged A2-A6,A15-A19,B2,B6-B8,B16-B18,F3-F5,F7-F8
   ip address 10.0.4.1 255.255.254.0
   ip helper-address 10.0.4.13
-- MORE --, next page: Space, next line: Enter, quit: Control-C   exit
vlan 8
   name "VLAN_8_PRINTERS"
   tagged A1-A6,B20,B22-B24,F5,Trk1
   ip address 10.0.8.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 12
   name "VLAN_12_WAPS"
   untagged B11
   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.12.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 16
   name "VLAN_16_ADMIN_WIFI"
   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B11,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.16.1 255.255.252.0
   ip helper-address 192.168.10.51
   ip helper-address 10.0.4.13
   exit
vlan 20
   name "VLAN_20_STAFF_WIFI"
-- MORE --, next page: Space, next line: Enter, quit: Control-C   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B11,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.20.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 24
   name "VLAN_24_STUDENT_WIFI"
   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B11,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.24.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 28
   name "VLAN_28_GUEST_WIFI"
   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B11,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.28.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 32
   name "VLAN_32_HIGH_SCHOOL"
   tagged A1,B22
   ip address 10.0.32.1 255.255.252.0
   ip helper-address 10.0.4.13
   ip helper-address 10.0.4.20
   exit
-- MORE --, next page: Space, next line: Enter, quit: Control-Cvlan 36
   name "VLAN_36_ELEMENTARY_SCHOOL"
   ip address 10.0.36.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 40
   name "VLAN_40_ADMIN_OFFICE"
   untagged A13
   tagged A1
   ip address 10.0.40.1 255.255.252.0
   ip helper-address 10.0.4.13
   ip helper-address 10.0.4.20
   exit
vlan 44
   name "PRESSBOX"
   tagged Trk1
   ip address 10.0.44.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 48
   name "VLAN_48_HVAC"
   tagged B20
   ip address 10.0.48.1 255.255.252.0
-- MORE --, next page: Space, next line: Enter, quit: Control-C   ip helper-address 10.0.4.13
   exit
vlan 52
   name "VLAN_52_SECURITY_CAMS"
   tagged B19-B20,B22-B24,Trk1
   ip address 10.0.52.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 56
   name "VLAN_56_FOB_DOOR_ACCESS"
   tagged B20
   ip address 10.0.56.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 60
   name "VLAN_60_PHONE_SYSTEM"
   untagged B13-B14
   tagged A15-A19,F3,Trk1
   ip address 10.0.60.1 255.255.252.0
   qos dscp 46
   voice
   exit
vlan 64
-- MORE --, next page: Space, next line: Enter, quit: Control-C   name "CAFE_POS"
   tagged A1,A4,A10,A15-A19,A21-A23,B3-B5,B7-B8,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.64.1 255.255.252.0
   ip helper-address 10.0.4.13
   ip helper-address 10.0.4.20
   exit
vlan 100
   name "FRIENDSHIPHOUSE"
   tagged A1,A4,A10,A21-A23,B3-B5,B7-B8,B19-B20,B22-B24,F1-F6,Trk1
   ip address 10.0.100.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 150
   name "VDI_MNGT"
   untagged B10
   tagged B9,F8
   ip address 10.0.150.1 255.255.252.0
   ip helper-address 10.0.4.13
   exit
vlan 160
   name "VDI_DESKTOPS"
   tagged B9,F8
   ip address 10.0.160.1 255.255.252.0
-- MORE --, next page: Space, next line: Enter, quit: Control-C   ip helper-address 10.0.4.13
   exit
vlan 200
   name "VLAN_200_FIREWALL"
   no ip address
   ip helper-address 10.0.4.13
   exit
spanning-tree
spanning-tree Trk1 priority 4
no spanning-tree bpdu-throttle
spanning-tree config-name "CPSD"
spanning-tree config-revision 1
spanning-tree pathcost mstp 8021d
spanning-tree priority 0 force-version rstp-operation
no autorun
no dhcp config-file-update
no dhcp image-file-update
device-profile name "default-ap-profile"
   cos 0
   exit
password manager

 

parnassus
Honored Contributor
Solution

Re: Major Network Issues

If I were you I would give K.16.02.0016 a try (or, if you don't feel brave enough [*], at least K.16.02.0014): starting with the K.16.02.0014 software build the TCP Push Preserve mode [**] parameter was set to disabled by default (it was enabled by default).

The TCP Push Preserve mode parameter set to disabled could be of help in very busy networks (very busy interfaces) - so, at least, check its status - and very busy networks can be easily found when you start having many concurrent 10Gbps (up)links to Servers and/or to other Switches; if I'm correct...you have the F Slot equipped with a HPE 8-port 10GbE SFP+ v2 zl Module (J9538A)...so a look at traffic of F1-F8 ports would be of help (also open an eye about overprovisioning [***] of 10Gbps ports).

[*] IMHO using K.16.02.0012, 0014 or 0016 is not particularly relevant when you're considering the most stable software argument...indeed you're using the latest K.16.02 software branch (and not a maintenance one) in all cases.

[**] From latest Release Notes:

"The TCP Push Preserve mode determines the queuing of the TCP packets that have the PUSH flag set. When this mode is enabled and the egress queue is full, TCP packets with the PUSH flag set are queued at the head of the ingress queue for egress queue space. This may delay subsequent incoming packets in the same queue and create a head-of-line blocking situation. When this mode is disabled and the egress queue is full, TCP packets with the PUSH flag set are dropped from the head of the ingress queue. If the current switch TCP Push Preserve mode has been set to DISABLED, it will be preserved as DISABLED and the corresponding configuration entries will be suppressed. If the current switch TCP PUSH preserve mode has been set to ENABLED, it will be changed to DISABLED and the change will be noted in system event logs as The tcp-push-preserve feature was disabled. This is a change to default configuration. The CLI command show tcp-push-preserve indicates the status of TCP push mode ENABLED/DISABLED. CLI command [no] tcp-push-preserve changes the status of TCP push mode."

[***] read this (could be of help, in any case).

dmcdonough
Occasional Advisor

Re: Major Network Issues

Thanks, I am going to give this a shot.

 

Quick question. After looking at some other forums i have read that even if you have spanning tree turned on, it will not block a loop that is on a small unmannaged 5 port switch. I have a bunch of small 5 port unmannaged switches in my environment, but all of them are uplinked to one of my managed hp switches that are now configured with spanning tree.

If this is true, then i would think i should take a walk around and see if i can find any loops in these small unmanaged switches.

 

Thanks

 

parnassus
Honored Contributor

Re: Major Network Issues

What's about:

trunk A24,B21 trk1 trunk

that's a static (Non-Protocol) Ports Trunk with Port 24 (Slot A Module) and Port 21 (Slot B Module) as member ports to where? why static (so non LACP IEEE802.3ad)?

Then (I'm a little bit puzzled here):

interface F8
   lacp active
   name "VDI"
   exit

is a little bit of a non sense (a single Port set as LACP Active?)...no trunk assignment here (I mean: no LAG). What's the purpose of setting "LACP Active" on a single port - Port 8 of Module on Slot F (6) - if LACP is used in Port Trunking (so with LAG) configurations?

I guess you're using a single 10Gbps port (F6) to connect to a Virtual Machine's OS on a Hypervisor or to a OS installed directly on a bare metal Desktop Server machine (in both cases through a physical 10Gbps NIC's port on a Server host) to provide VDI Virtual Desktop Infrastructure connectivity. Is it right?

16again
Respected Contributor

Re: Major Network Issues

Is switch ARP table growing out of bounds?

afaik: Creating a loop on an unmanaged switch should also trigger STP, since you're also looping back BPDUs

dmcdonough
Occasional Advisor

Re: Major Network Issues

I have two 1 gig fiber connections between both buildings. I have combined them together in trk1 to improve the inter building connection. I had the HP tech check trk1 and the have verified that it looked fine.

The other VDI port F8 is a 10gig connection to my Dell VDI switch that connects to my VDI host hypervisor cluster. I setup the lack now, because I will be adding a second 10 gig connection soon when I add a second VDI Dell switch for redundancy (per my Dell tech who helped configure the Dell VDI infrastructure side of my network)

So right now I only have F8 but will be adding a 2nd 10 gig connection to mix very soon, hence the lacp config on F8.

I have 100 thin clients out on my network ( mostly in my 2nd building on my production network. They connect over my trk1 lacp and then go through F8 VDI lacp to my VDI environment where my virtual desktops reside.

dmcdonough
Occasional Advisor

Re: Major Network Issues

That's what I expected, that spanning tree would also block down line unmanaged switches if they had a loop.

Thanks for clarifying.

Where would I go and what command on the switch would I use to check if my ARP table is growing out of bounds?

Would I see any error in my switch log?

Thanks
parnassus
Honored Contributor

Re: Major Network Issues

Start with show arp and show mac-address commands.
dmcdonough
Occasional Advisor

Re: Major Network Issues

ran show arp

i see the normal arp table. what would i see on this output if anything was out of bounds?

 

i see ip address ,mac address, type and port.

 

All seems normal

 

 

parnassus
Honored Contributor

Re: Major Network Issues


16again wrote: Is switch ARP table growing out of bounds?

Would out of bounds mean over 64k entries (considering that HPE 5400zl Switch Series's MAC Table Size should admit up to 64k entries)?

Vince-Whirlwind
Honored Contributor

Re: Major Network Issues

STP doesn't detect remote loops on unmanaged switches unless you are actually sending BPDUs out of Access ports AND the unmanaged switch is passing them. This is what "loop-protect" is for, you put it on all Access ports to protect you from unmanaged switch loops.

The way I very quickly solve these kinds of issues is by gathering switchport statistics. I use Solarwinds Performance Monitor.
I fire it up, point it at all the switches, adding all switchports as monitored objects and half an hour later I will see which switchports have dodgy-looking traffic profiles.

dmcdonough
Occasional Advisor

Re: Major Network Issues

Think I have finally made some progress. 

I believe the issue is with a down line switch on the other end of my fiber connection between my buildings over my (trk1) i ended up finding a replacement switch and off loaded my phone system in my other building on the new switch. After waiting a few hours and checking the new switch and phone system connections are holding steady as the other switch in question is still showing packet drops. 

Still waiting for a level 2 HP engineer to call.

I am going to see how inter building phone communication is tomorrow and see if the new switch works out. If all looks good I will transfer everything else over to the new switch.

Once HP calls i will work with them to replace the possible bad switch.

Keep you posted on the final outcome.

dmcdonough
Occasional Advisor

Re: Major Network Issues

Thanks Parnassus,

The HP level 2 didn't suggest the update the firmware, however he did ask me to run: no tcp-push-preserve command on my core and any other edge switch that would accept the command. Once i did this the network settled right down.

We are still troubleshooting a few other minor things on my network.

I have a ping program running a constant ping to my core along with the 5 network switches that are in my 2nd buidling. Buidlings are connected together with 1 gig fiber connection / tranceiver.

I am still getting some random packet drops or ping timeout on 1-2 of the switches in my 2nd building. I updated the firmware on all of my other switches.

Other than that, my phone are good, however i am sitll showing high round trip delay errors even though no one has reported call quailty issues. I have and avaya ip office in both buidlings connected together over the same 1 gig fiber trunk.

 

Vince-Whirlwind
Honored Contributor

Re: Major Network Issues

I think you're looking to find a fault in the switches when the fault actually lies in your environment, specifically: 1Gb for uplinks is no good - you are surely seeing congestion on these. The tcp-push-preserve "fix" isn't going to relieve the congestion. Your ping latency is showing that you have congestion.

PCs now have 1Gb NICs as a rule, and these days a PC running something like FTP can easily achieve a throughput of over 500Mb/s, meaning that two PCs running FTP will max out their uplink.

Uplinks need to be 10Gb.

I had a network with 2x 5412 in the core that performed fine with about 40x 10Gb ports on each 5412, uplinking to about 2000 active switchports.

parnassus
Honored Contributor

Re: Major Network Issues

Hello @dmcdonough: glad things are running nicely than before.

The point is, as @Vince-Whirlwind pointed out, that your actual highway (Trk1) between your buildings is less capable of the route to your VDI system...so congestions could be always around the corner especially if the majority of clients are on the other side of the trunk flooding it constantly (would interesting to monitor trunk usage and VDI link usage to see the range of ingress/egress data traffic flowing into/from them).

You should plan an upgrade of your Trunk from 2x1Gbps aggregated links to a 2x10Gbps aggregated links or, if you can't afford that configuration (physically fiber optic cables are there already), trying at least to play with just one 10Gbps trunk (in both cases you need SFP+ Modules and SFP+ Transceivers on both ends of the trunk).

parnassus
Honored Contributor

Re: Major Network Issues


dmcdonough wrote: ...however i am sitll showing high round trip delay errors...

 


Can you quantify the (average) delay? High Round Trip value isn't necessarily an error...also having a fat pipe doesn't necessaarily mean that TCP data transfers will become higher (this article could be an interesting reading).

What is the fiber optic trunk's latency value (below/around millisecond I suppose)?

Jcckmc
Occasional Visitor

Re: Major Network Issues

What i have found is that meerly setting Spanning-tree up does not prevent unmanaged switches from looping the HP switch they are connected to.  When a loop is presented on an unmanaged switch spanning-tree of the core switch will shut off the uplink to the edge switch  or the edge switch cpu will be overrun and cannot be accessed remotley.  At that point you have to console into the edge switch to identify the port that has an unmanaged switch with the loop.  What we did to combat this is turn on BPDU-protection on the station end ports that use unmanaged switches.  This setting will shut off the port connected to the looped unmanaged switch without affecting the entire edge switch.   Since we implemented this we have not had anymore switch level outages caused by loops on unmanaged switches.

Vince-Whirlwind
Honored Contributor

Re: Major Network Issues

You should use Loop-protect for that.
BPDU protection is intended to guard against devices outside your own STP topology sending BPDUs into one of your switches and affecting your STP topology.