Identifying the source of a broadcast storm

Neil_77 · ‎01-10-2005

Hi,

the other day, we had a broadcast storm on our network (HP 4000M and 8000M switches running C.09.22) which basically ground the network to a halt.

I wasn't actually around at the time of the incident but I have been tasked with investigating what actually caused this broadcast storm.

I've began ploughing through the alarms received on Procurve Manager at the time, as a starting point.

On all the switches in the network affected, there are a number of "high collision or drop rate" alarms and also "excessive broadcasts" indicating there was indeed a broadcast storm.

My problem is - there is plenty evidence from the alarms to indicate there was a problem, but how can I actually find out the SOURCE of the problem, using the information from the switches?

If I look in the switch event logs, I can also see a number of messages stating "ip: Invalid ARP Source: w.x.y.z on w.x.y.z" where w.x.y.z is the address of the particular switch itself. There are multiple messages of this sort, aswell as the occasional "ip: Invalid ARP Target: 0.0.0.0 on w.x.y.z"

It looks like there was a loop introduced. (Spanning Tree is enabled on all switches, incidentally) but can anyone help me identify what actually caused the loop?

As far as I'm aware nothing has been changed/added to the network (at least noone is admitting to it!) so what are the possible causes of this issue?

Any help or troubleshooting tips would be appreciated!

Thanks.

Ron Kinner · ‎01-11-2005

If Spanning Tree was indeed enabled everywhere then this should not be a loop caused event unless there are a few ports not in Spanning Tree or unless Spanning Tree is broken in this software version. You might check and see if there are any interfaces where it is disabled for some reason.

Could it have been a virus or worm infection? Our network went to its knees when we had just a few PCs infected.

Are your switch synced to a common time source? You might be able to use the time (if it shows up in the logs) to narrow it down to the switch where it started from.

Could you post a network diagram as an attachment?

Ron

Neil_77 · ‎01-11-2005

Hi Ron,

thanks for the response.

As far as I'm aware, with these 4000M/8000M switches you enable Spanning Tree on the entire switch, you can't actually disable it on a per port basis.

You CAN however choose whether to have the port in "Norm" or "Fast" mode - Fast mode being that it will start forwarding straight away. I did notice there are some ports in Fast status, but I think these are all connected to end nodes rather than to another switch, so should be ok?

What I have now found out - the theory behind the problem that we had here is that perhaps someone was using the network to "Ghost" OS images across (multicasting) and this heavy traffic overloaded the switches. (Apparently, this has been a problem in the past - I am new to this Company though so don't have that kind of in-depth knowledge of the network yet!) Does this seem feasible?

They actually switched off the machine that could have been carrying out this multicasting and, sure enough, the network returned to normal.

However, noone would admit to having used that PC for multicasting and there doesn't seem to be any log on the PC itself that would give any indication of what it was doing at the time. It has since been reconnected to the network without any problems.

So, what I really need is to find any means of gathering evidence that would point to this particular device being the cause. Or indeed, if this theory is a red herring, then identifying the true cause.

The switches are not synced to a common time source (though they should be - thats something else I will have to sort out!) so I can only really look at the times the alarms were received on our Procurve Manager software.

This doesn't help too much as everything pretty much happened at the same time - a flood of events, there is no indication of it starting at any one particular switch. (well, not that I can see anyway!)

Any more suggestions on how to progress with this investigation?

(Of course, the various means of broadcast control that could be implemented will form my recommendations - I'm not too concerned with that aspect, more on how to actually identify the source of this particular problem!)

Les Ligetfalvy · ‎01-12-2005

It does sound like the problem was not caused by a loop but I think you have little forensic evidence to prove or disprove it. I think your only recourse may be to try and repro it.

Unfortunately, when it comes to these sort of events, you pretty much have to be there right when it is happening. If the culprit is a single rogue NIC, the PCM+ Top 5 View should clearly show it, provided the network is not so seriously compromised that PCM+ itself cannot communicate with the switches.

Provided you are around to do it, a network sniffer would really be your friend and could carry on if/when PCM+ cannot.

The info that you gather during such a crisis could be limited since the network performance could impede in-band access so it is important to have a variety of tools that cross references MAC addresses to end nodes and their locations. I once had a PC with a rogue NIC taken out of service only to be reconnected to my network months later. Having only the MAC address to go by and an old record in my database that reported its last known location, it was a challenge to locate it as the entire network was compromised, limiting in-band communication with my switches.

Ron Kinner · ‎01-12-2005

I don't know how your network is put together but it could be possible for a heavy multicast to clog the links. And there is a Ghost multicast mode so I guess it's a possibility.

What you might do to prevent it is to manually set the link on the Ghost Server to 10 Meg. That would slow it down enough to prevent it from clogging the network.

There are also multicast management possibilities such as IGMP on your 4108s.

Ron

Neil_77 · ‎01-13-2005

Guys, thanks for the responses.

Les - yes, think you are right - without actually being there when it happened, its going to be difficult for me to prove the cause.

Unfortunately, we only have the free version of PCM and not PCM+ so do not have the traffic analysis capabilities there, in any case.

You mentioned tools which reference MAC addresses to end nodes - do you have any (preferably free!) tools that you can recommend, as this is something I'd definitely be interested in.

At the moment, its rather a long-winded process for me to tie up the MAC address to the end node so something which makes life easier on that front would be very useful!

Ron - yes, I think Ghost multicasting is what we may have to put this one down as being the most likely cause. Techniques like IGMP will, of course, form part of my recommendations. And manually setting the link to 10 Meg is a good call also.

Thanks again.

Marcus Augustus_1 · ‎01-13-2005

I agree with the previous poster; you're never gonna find out what caused it using only the info logged at the switch.

You might want to hook up a sniffer so you will have something to go on if it happens again.

As to ghost; it can severely clog up your switches using multicast. Turning on IGMP-filtering on the switches will be helpfull.

As to your last question: MAC to IP-mapping can be obtained in several ways:
For the local subnet you can just make a scipt which pings all nodes and then do arp -a.
Or, login on the router, assuming it is a cisco, do 'show arp'

Or, if you have snmp enabled on your router you can get them through snmp.
I actually build a script once (in perl) which did exactly that, plus getting a zonetransfer from our DNS to resolve the ip-addresses to names.

HTH,

Marcus Augustus

Neil_77 · ‎01-13-2005

Hi Marcus,

thanks for the response.

I guess I was a little unclear with what I meant about the MAC address mappings (and perhaps I got my wires crossed with what Les was actually talking about)

But what I really was after is a tool which will tell me which end-node is connected to which port on my switch. As you say, obtaining the IP to MAC mappings are fairly straightforward.

And if I log onto the switch, I can see which MAC address is on each port, on an individual port basis. But what would be really good is an easy way of extracting this information from ALL ports and ALL switches. Then I know all the devices connected and on which port.

Any suggestions on that front would be welcome!

Les Ligetfalvy · ‎01-14-2005

Neil,
I have a bit of a dog's breakfast in my arsenal of tools.. free, cheap, and very expensive.

I sniff with Ethereal which is free.
I trend with MRTG which is free.
I monitor and notify with WhatsUp which is cheap.
I monitor with PCM+ which is moderately priced.
I monitor and sniff with Fluke INA, WGA, OVC, and PE which cost an arm and a leg.

It would seem that the old joke about only getting two out of the three of "Good, Fast, Cheap" probably apply here.

I don't want this to sound like a sales pitch for Fluke, what with this being a HP forum but will give you my opinion. PCM+ has some nice features but it is no Swiss Army Knife. It is however the primary place I monitor the network from and I rely on Fluke to fill in the spaces.

I generally rely on the OVC DB to reference a MAC/IP to a particular switch port but it is weak on tracing switchroute. It has some good reporting as well. It however does not do a detailed analysis on first discovery so it would have no detail on a new device that is behaving badly.

The INA and WGA do the trace switchroute well but their DB is more volatile. Of the two, the INA being self-contained can do better under heavy network pressure. The WGA is cheaper to leave laying around on some distant segment and not as attractive to the eye and sticky fingers. It however relies on the network to view so unless you have it well partitioned to isolate a management VLAN that is unaffected by the production VLAN, it can let you down in a crunch. The same could be said for PCM+.

The whole concept of partitioning the network with a separate VLAN makes sense but since not all devices support a management VLAN, it can be challenging to setup. Of course it is like shopping for a crash helmet... For a $100 head, you would buy a $100 helmet.

Neil_77 · ‎01-17-2005

Les,

thanks for the info, thats a great help.

Looks like I have a similar armoury to you on the cheap and cheerful front!

I also use Ethereal for network sniffing, and WhatsUp for up/down monitoring.

Only got the free version of PCM though - the trial of PCM+ wasn't impressive enough to warrant us buying the full version, though some of the features would be nice to have (even if I do think a lot of them still need some "fine-tuning", shall we say!)

MRTG is something I am very interested in setting up here (if I can find the time!) so, be warned, you may very well find me back on these forums asking you some questions about it!

Fluke is something we are very keen on getting here too, I am quite hopeful on that front as we have heard a lot of good things about it.

Neil

OLARU Dan · ‎01-19-2005

SolarWinds' product may help you, Neil:
http://solarwinds.net/Tools/Network_Discovery/Switch_Port_Mapper/index.htm

Neil_77 · ‎01-23-2005

Thanks, Dan, that does indeed look like a useful tool!

seymour999 · ‎02-22-2005

Solarwinds switch port mapper is indeed a useful tool.

If you want an ongoing record of MAC-port assignments and changes, you can also use Kiwi CatTools. Schedule it to harvest MAC address-port assignments more often than the switch's MAC address aging time (which you've no doubt increased from the default 5 min). CatTools also reports the dates when it first and last saw each MAC address.

Collecting the router's ARP table at the same time allows building a spreadsheet relating IP address, DNS name and MAC address of all attached devices.

As you're probably aware, Ghost also has a unicast mode. Depending on what you're trying to achieve (and your network configuration) that will put less stress on the network.

Hope this is useful.
Evan

seymour999 · ‎02-22-2005

Solarwinds switch port mapper is indeed a useful tool.

If you want an ongoing record of MAC-port assignments and changes, you can also use Kiwi CatTools. Schedule it to harvest MAC address-port assignments more often than the switch's MAC address aging time. CatTools also reports the dates when it first and last saw each MAC address.

Collecting the router's ARP table at the same time allows building a spreadsheet relating IP address, DNS name and MAC address of all attached devices.

As you're probably aware, Ghost also has a unicast mode. Depending on what you're trying to achieve (and your network configuration) that will put less stress on the network.

Hope this is useful.
Evan

Tombo · ‎05-25-2005

Hi Neil.
I have had this exact same problem occur recently. Just wondering if you found a simple way of resolving it. In particular the error you described.
(If I look in the switch event logs, I can also see a number of messages stating "ip: Invalid ARP Source: w.x.y.z on w.x.y.z" where w.x.y.z is the address of the particular switch itself. There are multiple messages of this sort, aswell as the occasional "ip: Invalid ARP Target: 0.0.0.0 on w.x.y.z").
Also here is an ok link for mrtg http://people.ee.ethz.ch/~oetiker/webtools/mrtg/mrtg-nt-guide.html

Zundapp · ‎10-18-2009

Neil,
I am undergoing the same problems and can't seem to find the culprit either.
Keep us posted as I will on any new findings or methods of troubleshooting these issues.
I don't know what is meant by multicast ghost? I also have the IGMP multicast filtering on my 8/4000/2400 series swithches enabled however I don't know if that's as efficient as IGMP snooping (if this may be a multicast issue). I talk of a well populated controls network in which we are using ethernetIP, which is a multicast protocol as well.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Identifying the source of a broadcast storm

Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm

Re: Identifying the source of a broadcast storm