- Integrated Systems
- About Us
- Integrated Systems
- About Us
05-02-2018 04:06 AM
What does LANCP "mean" by Multicast in the output of LANCP SHOW DEVICE dev /COUNTERS ???
I was looking at a problem yesterday where a DECserver appeared to be unreachable (a batch job that does a TSM SHOW SERVER * every 5 minutes, to give us "advance" notice of DECservers that have lost power or network connection, or have crashed/rebooted).
The DECserver was reachable, and SHOW SERVER STATUS indicated that it (still) had an uptime of 699 days.
SHOW SERVER COUNT indicated that it had a User Buffer Unavailable count of ~18000 (I snapshot the counters across the DECserver estate once a week, and UBU is mostly 0 or still in single digits).
I then noticed that the Multicast Frames Rcv'd count was (in the 4.5 days since the counters were reset) ~2.3 times what it would normally get in a week.
Checking a handful of other DECservers showed the same sort of figures for the UBU and MFR counters.
My conclusion was that something had generated a storm of multicast frames in a short space of time which the DECserver was too busy dealing with to respond to TSM's connection to the remote console management port (which I guess is essentially what TSM SHOW SERVER * does).
I guess that DECservers earlier and later in the alphabetical list in the TSM database were not "unreachable" either because the storm occurred either before or after TSM attempted to connect to the remote console port.
I then had a look at counters that we snapshot on two of our OpenVMS nodes once per minute and store in a CSV file:
MC NCP SHOW KNOWN LINE COUNTERS
MC NCP SHOW KNOWN CIRCUIT COUNTERS
MC LATCP SHOW LINK /COUNTERS
MC LATCP SHOW NODE /COUNTERS
MC LANCP SHOW DEVICE EZA0 /COUNTERS
UCX SHOW INTERFACE ZE0 /FULL
There didn't appear to be any sizeable jumps in counters, particularly the Multicast Blocks Received or Multicast Bytes Received counters in the snapshots of the minutes before and after the time that the batch job reported one DECserver as being unavailable.
Unfortunately (as is always the case) Wireshark (or similar) was not constantly monitoring and capturing network traffic, so I have no idea what the multicast traffic was or where it came from.
The networks team indicated that the switches didn't record any events or high volumes of traffic or CPU usage on the switches at the time of the event.
Two DECservers connected to the same network switch as one of the OpenVMS nodes similarly showed high values for the UBU and MFR counters (unfortunately, there are no DECservers connected to the same switch that the other OpenVMS node is connected to, but I imagine that they would have reported the same - a straw poll of a handful of other DECservers (all on the same VLAN, but scattered throughout the site) reported high levels as well)).
It might be that I am misinterpreting the Multicast counter values that we're getting from LANCP (I don't know what LANCP means by a Multicast "Block" - is that essentially an Ethernet frame?), but I was wondering whether or not what LANCP reports when it talks about Multicast traffic is different to what a DECserver reports?
In a copy of the "DECnet Digital Network Architecture Ethernet Data Link Architectural Specification V1.0.0 SEP-1983" that I previously found somewhere on the Internet, it says this:
"The user of the DNA Ethernet Data Link is concerned with two different levels of identification. Users identify either a channel or a portal."
"A portal data base contains the lists of protocol types and multicast addresses that the user has enabled for receipt of incoming frames."
"The user will receive frames only for enabled protocol types and multicast addresses. It also contains the lists of outstanding transmits and receives for the user."
"A portal receives multicast frames only for those multicast addresses it specifically enables. In this context, broadcast is treated the same as other multicast addresses. Multicast addresses enabled or disabled on one portal have no effect on other portals."
"Conceptually, receive filtering is done first by protocol type, then by multicast address. A frame that does not pass filtering is discarded. In actual practice, an implementation may first filter in hardware the union of the multicast addresses for all portals and then do the filtering again on a reduced number of frames."
I'm not a network wizard by any stretch of the imagination, but from my reading of the above (particularly the last quoted paragraph), it suggests to me that when LANCP reports on Multicast "Blocks" or bytes, this is only for protocols that have been enabled for a portal, so if the protocol type specified in the Ethernet frame was not one that it was interested in, even though it had received the Multicast frame, it wouldn't actually count it???
Is it the case then perhaps that DECservers (700, specifically) when they report on Multicast, they are perhaps less intelligent, and are reporting on all Multicast traffic, regardless of the protocol type specified in the Ethernet frame?
[I realise that without a capture of network frames from the event time, trying to work out who/what/where/why might be difficult, but at least if I could establish what protocols' Multicast frames LANCP is counting compared to the DECserver, I might at least be able to rule out the Multicast traffic as being for particular protocols...
We have checked all of the DECservers and none of them have rebooted, so it doesn't appear to be MOP-related multicast traffic (either for downline load or dumping of the server image).
This is the second network storm in less than a week; unfortunately, their nature is different - last week, the .COM that runs /DETACHED and which snapshots the counters every minute failed because there was no error handling for the UCX SHOW INTERFACE ZE0 /FULL failing - it failed because there the maximum number of BG devices had reached its limit (it's set to 300 and even on our busiest system with lots of Telnet connections, it peaks at ~70-80 devices; so something somewhere was hammering the node with a lot of TCP/IP/UDP packets (unfortunately, when that occurs (and if it is persistent) you can't issue a SHOW DEVICE /FULL command from UCX to see what protocols and IP addresses are being used by each of the BG devices because executing the command itself requires a BG device (as does the command to increase the number of devices)).
Unfortunately, there's way too much network traffic to have Wireshark recording 24/7 on the offchance that something starts flapping on the network, so we can see who it is; if something starts persistently flapping, obviously, that will force our hand...]
Thoughts on the back of a postcard, please...
05-02-2018 05:34 AM - edited 05-02-2018 05:36 AM
Re: What does LANCP "mean" by Multicast in the output of LANCP SHOW DEVICE dev /COUNTERS ?
'block' = 'packet' = 'PDU' (protocol data unit) - all refer to Ethernet Frames.
Using ANALYZE/SYS and SDA> SHOW LAN/FULL will show you the multicast addresses enabled on the OpenVMS system as well as the Mcast PDU (received and sent) counters per protocol. Maybe this can help to find out, which protocol is involved.
If you add up all the Mcast counters from all protocols and subtract it from the overall LAN interface Mcast counters, the remainder might be 'Broadcast' traffic (ARP comes to mind as a usual suspect).
05-16-2018 04:19 PM
Re: What does LANCP "mean" by Multicast in the output of LANCP SHOW DEVICE dev /COUNTERS ?
Volker, thanks for your reply, and apologies for the delay in responding - unfortunately, when one is doing the work of ~8 people, it doesn't leave a lot of free time, especially when other issues occurred last week that needed me to take action...
>'block' = 'packet' = 'PDU' (protocol data unit) - all refer to Ethernet Frames.
Thanks for clearing that up!
>Using ANALYZE/SYS and SDA> SHOW LAN/FULL will show you the multicast addresses enabled on the OpenVMS system as well as the Mcast PDU (received and sent) counters per protocol. Maybe this can help to find out, which protocol is involved.
It was certainly useful, and highlighted another issue - that on one node in the pair, (MOP) service was not enabled on the circuit, which meant that only the other node might ever respond to downline-load requests from the DS700s (I had observed this on the test system, although I think it was the other way round as to which node was affected).
Unfortunately, you can't enable service without setting the state to OFF first, so I'll need to wait until we get downtime in 2 weeks' time.
Curiously, on the node which did have service enabled, this appeared to cause two multicast addresses to be declared:
AB-00-00-01-00-00 [DNA Dump/Load Assistance (MOP)]
CF-00-00-00-00-00 [Ethernet Configuration Test Protocol]
[Parenthesised description added by me, after doing a bit of Google research]
Well, either that, or there is some other configuration difference that causes the CF-00-00-00-00-00 address to be declared if it isn't (MOP) service enabled...
>If you add up all the Mcast counters from all protocols and subtract it from the overall LAN interface Mcast counters, the remainder might be 'Broadcast' traffic (ARP comes to mind as a usual suspect).
That certainly appears to be the case, although unfortunately, I can't definitively pin ARP multicast traffic being the cause of this...
According to one of our networks guys, as the OpenVMS nodes and the DECservers are on the same VLAN, it means that the ARP multicast traffic must be originating from something on the VLAN (and supposedly, there are no switch ports configured for this VLAN but with something else plugged in or left "open" than someone randomly plugging in a laptop or desktop PC could cause this).
I checked all of our DECservers to confirm that they all had the same high level of MFR and UBU counts; two had significantly higher values, but on investigation, it was because they were recently added in to the network by someone else in my team, and I had omitted to update a script which grabs server config & counters & resets the counters, to include these two.
One of the DECservers had lower counts, but had rebooted a few days before the event of one DS700 not being reachable (power had been switched off to the cabinet), so the reduced count was because it had not been counting background ARP multicast for as long as the others.
When I compared the MFR counts against the seconds since zeroed on previous snapshots, it averaged out to 1 frame every 75.52 seconds; when comparing it to the time the DECserver was not reachable, it was 1 frame every 66.07s (of course, a lot of those frames may have occurred within a short space of time, so this skews the average when you are simply dividing the seconds since zeroed by the MFR count).
At this point, I still considered the ARP multicast traffic to be the cause of the problem, but that because I wasn't doing snapshots frequently, I couldn't be certain if the MFR count jumped excessively over a very small period of time (say 1 or 2 seconds).
Looking at previous weekly snapshots, the User Buffer Unavailable count across the DECserver estate is not very high (either 0, or still single-digit) - but in the case of when the event occurred, shortly after checking the counters, the UBU count had risen to ~18,000 (from - presumably - less than 10).
I then re-read the significance of the UBU value -> a frame is discarded before it is even checked to see if it was a known/acceptable multicast address.
For the number of configured DECservers, it takes ~2 seconds for a TSM SHOW SERVER * to complete normally, so the implication is that ~18,000 frames were discarded in a 2-second period.
As it was discarded, there's no way of confirming it is multicast traffic much less ARP.
I think that short of having Wireshark constantly logging frames on the VLAN, waiting for the problem to recur, then reviewing the capture to see who/what is generating an absurd level of traffic, I won't get to the bottom of this.
I had thought that this problem had occurred more frequently, but on searching my old email history, I could only find 3 instances of a DECserver being "unreachable" and then when we attempt to log on to it we can (and the uptime has not dropped to 0 00:01:00 or similar).
However, it was the same DECserver on all 3 occasions. It could be the case that it's a faulty NIC inside it, and *it* is causing the problem itself...
Thanks once again for your post, the SHOW LAN /FULL was very helpful - I've actually now incorporated it into my weekly counter snapshot script, so that I can keep historic records and compare delta values in case of future events.