Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

Network problem with 4000M Procurve switches

Alex_202
Occasional Visitor

Network problem with 4000M Procurve switches

I have a major problem. We have 14 interconnected by fiber link 4000M Procurve switches. But every day for the past 2 years one of the switches connecting users in a lab goes down (when for example 10 or more users download 90 mb file froma server tha is not diectly connected to the switch) after about 3 min it brings down entire network. all of a sudden every port has solid green light. after reset networks work for a while. I tried HP support . we enabled stp but no luck. i check port counters every day and do not see any errors. I don;t know what else to try

Thanx for any help
9 REPLIES
suniram
Occasional Visitor

Re: Network problem with 4000M Procurve switches

Hello Alex,
Your problem description is ok, but not enough data.
- topology of network?
- firmware version off switches ?
- you say that one of the switches goes down. Is that always the same switch, or will it be any switch?
- why did you activate STP? Are there loops in the network?
- Is your network meshed?
- when multiple users are getting large files from that server, and both server and users are on the same switch, does a switch still crash?

I would start breaking the uplinks between the switches and begin troubleshooting on one switch only. If the problem would have disappeared in that kind of setup, I would start slowly connecting more and more switches until the problem can be reproduced and further isolated. If you have the problem every day as you say, I believe it to be worth the investment in time and inconvenience.

It can be anything from a software glitch to a faulty module/transceiver or misconfiguration.

I am sure I can think of a few more questions :-) I strongly advise you to contact the ECCC in Amsterdam (http://www.hp.com/rnd/assistance/assistance.htm)
Alex_202
Occasional Visitor

Re: Network problem with 4000M Procurve switches

Thanx a lot for the reply.Here is the topology that I have. Arizona switch goes down first and brings entire high school wing down in about 3 min and in about 20 min entire network goes down. All switches have latest firmware installed, default STP settings enabled. No errors on any ports. We tried to replace cards on that Arizona switch and switch itself with the brand new one without any luck. As I mnetione before the problem occurs when users connected to Arizona switch try to dl a large file at the same time or if we try to image workstations connected to that switch from a server that is connected to a switch in the server room.
Stuart Teo
Trusted Contributor

Re: Network problem with 4000M Procurve switches

I'm not sure why you designed your network this way but I wouldn't do it the same way as you did.

There are too many hops for some packets. E.g. packets originating from Sea Princess would take 7 hops to get into Arizona.

Please refer to the 5-4-3 rule as explained by http://www.webopedia.com/TERM/5/5_4_3_rule.html
If a problem can be fixed, there's nothing to worry. If a problem can't be fixed, worrying ain't gonna help. Bottom line: don't worry.
Alex_202
Occasional Visitor

Re: Network problem with 4000M Procurve switches

That network was designed long before me. I just happened to be the one to resolve the network issue.

Thanx for pointing to the article about the topology but it says that rule 5-4-3 does not apply to switched network. Please correct me if I am wrong.

Thanx for your help
Ron Kinner
Honored Contributor

Re: Network problem with 4000M Procurve switches

What I think is happening is that the heavy traffic overloads one of the links and some replies do not get back to one of the switches. The switch will normally flush its arp cache of any entry it hasn't seen within the last x minutes and because of the heavy traffic some of the entries do not get refreshed and they fall out of the table. When this happens and the switch gets a packet destined for one of the MAC addresses which is no longer in the table then it has no choice but to send the packet out to each of its ports. This adds even more load to the links, causing other ARP entries not to be renewed and eventually brings everything to a screeching halt. I think I remember that the default timeout is on the order of 3 minutes. Try setting the arp cache timeout on each switch to 30 minutes and see if that helps. I'm on a slow dial up link or I'd look up the command for you. If you have any trouble finding it post back here and I will look it up for you on Monday when I get back to work.


Ron
Stuart Teo
Trusted Contributor

Re: Network problem with 4000M Procurve switches

To put it simply, yes. The 5-4-3 rule was obsoleted by switching technology. (generally speaking) But bear in mind that propogation time is a function of media type and distance. That was good enough a reason for me to keep that rule in mind.

A favored design philosophy is to categotrize the switches into 3 roles: core, distribution, access. For smaller networks, it is possible to skip the distribution layer.

But back to your specific problem, can you provide us with more information about the switch setup?

What version of firmware are they running? Are they all running the same version? Are any ports configured for meshing? How does the CPU utilization look like? How does the memory utilization look like?

I've seen the solid-green-LED all lighted up symptoms before and it's always related to a loop. But since you have STP turned on, was it turned on for all switches?
If a problem can be fixed, there's nothing to worry. If a problem can't be fixed, worrying ain't gonna help. Bottom line: don't worry.
Ron Kinner
Honored Contributor

Re: Network problem with 4000M Procurve switches

They seem to have removed the command line interface manual from the doc page but I found on an old post that it was
mac-age-time so I guess it is
set mac-age-time x
(x max = 10000000 I don't know what 30 minutes would be but probably if you do set mac-age-time help it will tell you what the units are.)
You might try it and see.

There is also something called abc (automatic broadcast control) which you might try enabling. It should limit broadcast to 30%.

Ron
Alex_202
Occasional Visitor

Re: Network problem with 4000M Procurve switches

I have tried to increase max age time to 30 minon all switches involved and no luck.I enabled debug logging. Nothing. Spent 2 hrs with HP support on the phone. They recommended to use fiber module in a slot other than I, J or H - no luck. I had 5 students I asked them to copy a few files, in the middle network went down. One thing I noticed though is that when network connection lost I was still able to connect via telnet from a PC connected to Arizona so I figured that maybe Arizona is not the culprit afterwords. I suspected the switch that Arizona is connected to could be a problem I checked the other switch and did not find anything abnormal. So I have ran out of ideas.

wif

thanx to everyone who responded to this thread
Ron Kinner
Honored Contributor

Re: Network problem with 4000M Procurve switches

Get a sniffer (I like www.snort.org which is actually an intrusion detector but there are a lot of freeware sniffers: tcpdump, ethereal etc.). Next time the problem happens you will at least have an idea of what the traffic looks like. That should help isolate the problem.

Ron