1752806 Members
6114 Online
108789 Solutions
New Discussion

Re: Polling false negatives on C7000 modules

 
SOLVED
Go to solution
thom_14
Regular Advisor

Polling false negatives on C7000 modules

I have iMC 7.2 (e403) running on Windows.  I've been chasing my tail on a "problem" with my network which may not actually be a network problem, but a problem with iMC.  iMC has been reporting various modules "down" - they show up red on the topology, and I get "Device XXX does not respond to ping packets"   I spent a bunch of time poking around trying to find where the problem was, but didn't find anything!   As a further test, I set up a continous ping from a couple other places on the network against one of the 'problem' devices - didn't lose a single packet.  Yet while I'm watching my ping in one window, I'll see a module light up red on the topology.  If I drill into the device and "refresh" or "synchronize" the condition goes away, although the downtime is still recorded and I see it when I go to "Performance at a glance"  

This only seems to be happening with C7000 modules - the OA and VC cards.  The underlying blades and VMs on those blades are happy as clams. 

I tried moving the OA network connections to different switches, to no avail - iMC correctly noted the unplugged OA, but till reports spurious phantom outages.

I turned down the polling interval (300s for status, 1500 for configuration) which just slows down the rate, but doesn't really fix anything.

All of the OAs and VC modules are at the latest patch levels

So, what I need to figure out is how to make the polling more resilient, so that iMC doesn't think the sky is falling every 5 minutes.  And why is it just with the C7000?  Is it because of the limited bandwidth of the OA connection?

9 REPLIES 9
LindsayHill
Honored Contributor

Re: Polling false negatives on C7000 modules

You can change the timeouts & number of retries, but you still shouldn't be seeing much packet loss, assuming these systems are at the same site.

You mention testing from different locations on the network. What's different about the path those packets would take, vs packets from IMC to the the C7000s? What's different, what's the same? Is there any common point?

thom_14
Regular Advisor

Re: Polling false negatives on C7000 modules

I tried from several different points on the network, with different numbers of hops in between..  I even tried from a site several states away, with the same result!

thom_14
Regular Advisor

Re: Polling false negatives on C7000 modules

Screenshot

LindsayHill
Honored Contributor

Re: Polling false negatives on C7000 modules

Usually with this sort of situation I'll try to identify commonalities & differences, to isolate where the problem lies.

Is this summary correct?

  • IMC is periodically reporting C7000 OA/VC cards as offline, even though they appear to be OK when manually checked. Other network devices are fine (ie they are only reported offline when they genuinely are offline.)
  • IMC ping configuration is set to default of 3 retries / 2 seconds timeout (System -> System Configuration -> System Settings -> Ping Configuration)
  • Running test continuous pings to the C700s from other locations around the network does not show any dropped packets. These packets traverse a range of network paths, including both the same & different paths to IMC -> C7000s.

Questions/Comments:

  • How frequently is IMC reporting those devices offline? In the above screenshot, it appears that IMC might be saying that it was only online for 29% of the last hour. Is that correct? This is odd behaviour. The default ping retry/timeout behaviour means that a couple of dropped packets won't mark a device as offline. 3 dropped packets in a row is a lot.
  • If you look at IMC graphs for ping response times, do you see consistent low latency, apart from when it fails? Or do you see latency spikes?
  • Do any other devices show similar behaviour?
  • Do you have any stateful firewalls between IMC & the blade chassis? Do you have multiple possible network paths, or just one? (Things get interesting if packets can take different paths across the network).
  • What happens if you set up a continuous ping from the IMC server, and let it run for an extended period. Do you see lost packets?
  • Have you tried capturing packets at the IMC server for an extended period, to check that replies are definitely not received? Should be easy enough to verify, if you're seeing frequent timeouts. If you do capture evidence that replies are not being received by the IMC server, that gives you an indication of the direction to look (i.e. somewhere upstream of the IMC server).
thom_14
Regular Advisor

Re: Polling false negatives on C7000 modules


  • IMC is periodically reporting C7000 OA/VC cards as offline, even though they appear to be OK when manually checked. Other network devices are fine (ie they are only reported offline when they genuinely are offline.)

Yes

  • IMC ping configuration is set to default of 3 retries / 2 seconds timeout (System -> System Configuration -> System Settings -> Ping Configuration)

Yes

  • Running test continuous pings to the C700s from other locations around the network does not show any dropped packets. These packets traverse a range of network paths, including both the same & different paths to IMC -> C7000s.

Yes

Questions/Comments:

  • How frequently is IMC reporting those devices offline? In the above screenshot, it appears that IMC might be saying that it was only online for 29% of the last hour. Is that correct? This is odd behaviour. The default ping retry/timeout behaviour means that a couple of dropped packets won't mark a device as offline. 3 dropped packets in a row is a lot.

iMC is reporting the device offline quite a bit.  I stretched the poll interval to 300s, which has decreased the average device unreachabilty %, but that's only because it's not checking as frequently.  

  • If you look at IMC graphs for ping response times, do you see consistent low latency, apart from when it fails? Or do you see latency spikes?

The latency is prety low (average ~10ms); there will be a couple of spikes here and there into the 800ms range

  • Do any other devices show similar behaviour?

This is the fun part: no - everything else is normal, and only reports as being down when I deliberately take things offline/reboot etc.

  • Do you have any stateful firewalls between IMC & the blade chassis? Do you have multiple possible network paths, or just one? (Things get interesting if packets can take different paths across the network).

No firewalls of any kind; again, the network paths to these modules are the same paths as everything else.

  • What happens if you set up a continuous ping from the IMC server, and let it run for an extended period. Do you see lost packets?

I did this on a few devices and let them run for several hours.  One device lost 3 packets, the rest lost zero.

  • Have you tried capturing packets at the IMC server for an extended period, to check that replies are definitely not received? Should be easy enough to verify, if you're seeing frequent timeouts. If you do capture evidence that replies are not being received by the IMC server, that gives you an indication of the direction to look (i.e. somewhere upstream of the IMC server).

I've run wireshark to watch the ICMP traffic while I'm running a continuous ping, and it looks like the packets are getting there OK, but I haven't done this for an extended period of time.    Outside of iMC, everything looks normal; there's something "inside" iMC that's wonky.


 

LindsayHill
Honored Contributor

Re: Polling false negatives on C7000 modules


@thom_14 wrote:

I've run wireshark to watch the ICMP traffic while I'm running a continuous ping, and it looks like the packets are getting there OK, but I haven't done this for an extended period of time.    Outside of iMC, everything looks normal; there's something "inside" iMC that's wonky.



Run Wireshark on the IMC server, and leave it running there for a while - long enough that you see IMC report outages for those nodes. Set it up with capture filters so it's only capturing ICMP traffic to/from the affected nodes.

Once IMC has reported an outage, look through the capture for that time window. I *expect* to see multiple echo-request packets, with no matching echo-reply. If you can capture that, then at least you know it's not within IMC. (Doesn't solve the problem, but at least gives you an idea of your next move). If you see echo-reply packets coming back within the timeout window, then the problem is IMC. 

thom_14
Regular Advisor

Re: Polling false negatives on C7000 modules

Ok, I let wireshark run for a while, and I found cases of 'no response found' - the device 10.50.2.2 is a Flex10 module; there is nothing special about this module itself, I just dug through until I found a failure case.

Now what do I do about it!

LindsayHill
Honored Contributor
Solution

Re: Polling false negatives on C7000 modules

Well, at least now you know that it's not the IMC server. If it's sending out a request, but no reply comes back, then it's doing the right thing by marking the node as unreachable.

My next move would be to investigate the OA closely. I'd be checking things like logs, firmware versions, etc.

If I had a suspicion that it was a network problem, I might set up a SPAN port immediately upstream of the OA, to verify that the pings were received at the last switchport, but that the OA never sent a response. But I'd probably dig into the OA first.

thom_14
Regular Advisor

Re: Polling false negatives on C7000 modules

I know we have network issues, but this ping issue only seemed to be manifested in the C7000 modules.  I set up another test against one of our better switches and got the same thing.  I did it again on a completely different machine and was able to reproduce it, so it's not iMC.   I see "no response found" in wireshark, but don't show a single lost packet from the command-line ping from which I generated my traffic.