Switches, Hubs, and Modems
1748267 Members
3770 Online
108760 Solutions
New Discussion юеВ

Identifying the source of a broadcast storm

 
Neil_77
Frequent Advisor

Identifying the source of a broadcast storm

Hi,

the other day, we had a broadcast storm on our network (HP 4000M and 8000M switches running C.09.22) which basically ground the network to a halt.

I wasn't actually around at the time of the incident but I have been tasked with investigating what actually caused this broadcast storm.

I've began ploughing through the alarms received on Procurve Manager at the time, as a starting point.

On all the switches in the network affected, there are a number of "high collision or drop rate" alarms and also "excessive broadcasts" indicating there was indeed a broadcast storm.

My problem is - there is plenty evidence from the alarms to indicate there was a problem, but how can I actually find out the SOURCE of the problem, using the information from the switches?

If I look in the switch event logs, I can also see a number of messages stating "ip: Invalid ARP Source: w.x.y.z on w.x.y.z" where w.x.y.z is the address of the particular switch itself. There are multiple messages of this sort, aswell as the occasional "ip: Invalid ARP Target: 0.0.0.0 on w.x.y.z"

It looks like there was a loop introduced. (Spanning Tree is enabled on all switches, incidentally) but can anyone help me identify what actually caused the loop?

As far as I'm aware nothing has been changed/added to the network (at least noone is admitting to it!) so what are the possible causes of this issue?

Any help or troubleshooting tips would be appreciated!

Thanks.
15 REPLIES 15
Ron Kinner
Honored Contributor

Re: Identifying the source of a broadcast storm

If Spanning Tree was indeed enabled everywhere then this should not be a loop caused event unless there are a few ports not in Spanning Tree or unless Spanning Tree is broken in this software version. You might check and see if there are any interfaces where it is disabled for some reason.

Could it have been a virus or worm infection? Our network went to its knees when we had just a few PCs infected.

Are your switch synced to a common time source? You might be able to use the time (if it shows up in the logs) to narrow it down to the switch where it started from.

Could you post a network diagram as an attachment?

Ron
Neil_77
Frequent Advisor

Re: Identifying the source of a broadcast storm

Hi Ron,

thanks for the response.

As far as I'm aware, with these 4000M/8000M switches you enable Spanning Tree on the entire switch, you can't actually disable it on a per port basis.

You CAN however choose whether to have the port in "Norm" or "Fast" mode - Fast mode being that it will start forwarding straight away. I did notice there are some ports in Fast status, but I think these are all connected to end nodes rather than to another switch, so should be ok?

What I have now found out - the theory behind the problem that we had here is that perhaps someone was using the network to "Ghost" OS images across (multicasting) and this heavy traffic overloaded the switches. (Apparently, this has been a problem in the past - I am new to this Company though so don't have that kind of in-depth knowledge of the network yet!) Does this seem feasible?

They actually switched off the machine that could have been carrying out this multicasting and, sure enough, the network returned to normal.

However, noone would admit to having used that PC for multicasting and there doesn't seem to be any log on the PC itself that would give any indication of what it was doing at the time. It has since been reconnected to the network without any problems.

So, what I really need is to find any means of gathering evidence that would point to this particular device being the cause. Or indeed, if this theory is a red herring, then identifying the true cause.

The switches are not synced to a common time source (though they should be - thats something else I will have to sort out!) so I can only really look at the times the alarms were received on our Procurve Manager software.

This doesn't help too much as everything pretty much happened at the same time - a flood of events, there is no indication of it starting at any one particular switch. (well, not that I can see anyway!)

Any more suggestions on how to progress with this investigation?

(Of course, the various means of broadcast control that could be implemented will form my recommendations - I'm not too concerned with that aspect, more on how to actually identify the source of this particular problem!)


Les Ligetfalvy
Esteemed Contributor

Re: Identifying the source of a broadcast storm

It does sound like the problem was not caused by a loop but I think you have little forensic evidence to prove or disprove it. I think your only recourse may be to try and repro it.

Unfortunately, when it comes to these sort of events, you pretty much have to be there right when it is happening. If the culprit is a single rogue NIC, the PCM+ Top 5 View should clearly show it, provided the network is not so seriously compromised that PCM+ itself cannot communicate with the switches.

Provided you are around to do it, a network sniffer would really be your friend and could carry on if/when PCM+ cannot.

The info that you gather during such a crisis could be limited since the network performance could impede in-band access so it is important to have a variety of tools that cross references MAC addresses to end nodes and their locations. I once had a PC with a rogue NIC taken out of service only to be reconnected to my network months later. Having only the MAC address to go by and an old record in my database that reported its last known location, it was a challenge to locate it as the entire network was compromised, limiting in-band communication with my switches.
Ron Kinner
Honored Contributor

Re: Identifying the source of a broadcast storm

I don't know how your network is put together but it could be possible for a heavy multicast to clog the links. And there is a Ghost multicast mode so I guess it's a possibility.

What you might do to prevent it is to manually set the link on the Ghost Server to 10 Meg. That would slow it down enough to prevent it from clogging the network.

There are also multicast management possibilities such as IGMP on your 4108s.

Ron
Neil_77
Frequent Advisor

Re: Identifying the source of a broadcast storm

Guys, thanks for the responses.

Les - yes, think you are right - without actually being there when it happened, its going to be difficult for me to prove the cause.

Unfortunately, we only have the free version of PCM and not PCM+ so do not have the traffic analysis capabilities there, in any case.

You mentioned tools which reference MAC addresses to end nodes - do you have any (preferably free!) tools that you can recommend, as this is something I'd definitely be interested in.

At the moment, its rather a long-winded process for me to tie up the MAC address to the end node so something which makes life easier on that front would be very useful!

Ron - yes, I think Ghost multicasting is what we may have to put this one down as being the most likely cause. Techniques like IGMP will, of course, form part of my recommendations. And manually setting the link to 10 Meg is a good call also.

Thanks again.

Re: Identifying the source of a broadcast storm

I agree with the previous poster; you're never gonna find out what caused it using only the info logged at the switch.

You might want to hook up a sniffer so you will have something to go on if it happens again.

As to ghost; it can severely clog up your switches using multicast. Turning on IGMP-filtering on the switches will be helpfull.

As to your last question: MAC to IP-mapping can be obtained in several ways:
For the local subnet you can just make a scipt which pings all nodes and then do arp -a.
Or, login on the router, assuming it is a cisco, do 'show arp'

Or, if you have snmp enabled on your router you can get them through snmp.
I actually build a script once (in perl) which did exactly that, plus getting a zonetransfer from our DNS to resolve the ip-addresses to names.

HTH,

Marcus Augustus
Neil_77
Frequent Advisor

Re: Identifying the source of a broadcast storm

Hi Marcus,

thanks for the response.

I guess I was a little unclear with what I meant about the MAC address mappings (and perhaps I got my wires crossed with what Les was actually talking about)

But what I really was after is a tool which will tell me which end-node is connected to which port on my switch. As you say, obtaining the IP to MAC mappings are fairly straightforward.

And if I log onto the switch, I can see which MAC address is on each port, on an individual port basis. But what would be really good is an easy way of extracting this information from ALL ports and ALL switches. Then I know all the devices connected and on which port.

Any suggestions on that front would be welcome!
Les Ligetfalvy
Esteemed Contributor

Re: Identifying the source of a broadcast storm

Neil,
I have a bit of a dog's breakfast in my arsenal of tools.. free, cheap, and very expensive.

I sniff with Ethereal which is free.
I trend with MRTG which is free.
I monitor and notify with WhatsUp which is cheap.
I monitor with PCM+ which is moderately priced.
I monitor and sniff with Fluke INA, WGA, OVC, and PE which cost an arm and a leg.

It would seem that the old joke about only getting two out of the three of "Good, Fast, Cheap" probably apply here.

I don't want this to sound like a sales pitch for Fluke, what with this being a HP forum but will give you my opinion. PCM+ has some nice features but it is no Swiss Army Knife. It is however the primary place I monitor the network from and I rely on Fluke to fill in the spaces.

I generally rely on the OVC DB to reference a MAC/IP to a particular switch port but it is weak on tracing switchroute. It has some good reporting as well. It however does not do a detailed analysis on first discovery so it would have no detail on a new device that is behaving badly.

The INA and WGA do the trace switchroute well but their DB is more volatile. Of the two, the INA being self-contained can do better under heavy network pressure. The WGA is cheaper to leave laying around on some distant segment and not as attractive to the eye and sticky fingers. It however relies on the network to view so unless you have it well partitioned to isolate a management VLAN that is unaffected by the production VLAN, it can let you down in a crunch. The same could be said for PCM+.

The whole concept of partitioning the network with a separate VLAN makes sense but since not all devices support a management VLAN, it can be challenging to setup. Of course it is like shopping for a crash helmet... For a $100 head, you would buy a $100 helmet.
Neil_77
Frequent Advisor

Re: Identifying the source of a broadcast storm

Les,

thanks for the info, thats a great help.

Looks like I have a similar armoury to you on the cheap and cheerful front!

I also use Ethereal for network sniffing, and WhatsUp for up/down monitoring.

Only got the free version of PCM though - the trial of PCM+ wasn't impressive enough to warrant us buying the full version, though some of the features would be nice to have (even if I do think a lot of them still need some "fine-tuning", shall we say!)

MRTG is something I am very interested in setting up here (if I can find the time!) so, be warned, you may very well find me back on these forums asking you some questions about it!

Fluke is something we are very keen on getting here too, I am quite hopeful on that front as we have heard a lot of good things about it.

Neil