- Integrated Systems
- About Us
- Integrated Systems
- About Us
10-02-2018 03:11 AM - edited 11-27-2018 03:46 AM
DECnet Phase IV Executor Characteristics Parameters, FAL & %SYSTEM-F-UNREACHABLE
Last night, a batch job failed after unexpectedly getting an error attempting to copy a file from one node to another.
The job successfully completed 3x DELETE commands on the remote node, then 2x COPY commands to the remote node, and failed on the third COPY command with the following:
%COPY-E-OPENOUT, error opening filename as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-F-UNREACHABLE, remote node is not currently reachable
The system wasn't (permanently) unreachable, and no subsequent FAL operations in other jobs failed.
I checked ERRLOG.SYS and OPERATOR.LOG on both nodes (no DECnet events reported at the time; 4.1 (node unreachable packet loss) and 4.14 (node reachability change) are not enabled for logging, but 4.15 (adjacency up) and 4.18 (adjacency down) are, though as I say, no events were recorded).
A separate .COM that runs as a detached process and snapshots various network counters once every minute into a .CSV file doesn't reveal any surge in network traffic or errors, accounting records don't reveal large numbers of processes starting around the time, nor terminating subsequently with heavy CPU or I/O usage, nor is there anything in the security audit journals on both nodes.
I suspect that this is "simply" the case of a transient network issue, but the Cisco switches either don't log things at a low enough granularity (to avoid being swamped with logging activity) or any issue only lasted for a single Ethernet frame or similarly was less than the threshold for Cisco logging activity.
I remember back at my second employer having a very detailed discussion by email with someone at HP (based in the UK, possibly originally from the DEC Worton Grange facilitity) who I think was called John, about the various NCP parameters and their relationship to %SYSTEM-F-UNREACHABLE (at the time, it was to do with the DECss7 product, and the rate at which the instance on one cluster node realised that another cluster node's instance was unreachable becaused the node had died).
I will still have that email (where he counciled against certain changes because of the impact that they would have) but unfortunately in a .PST file which I don't have any means of readily viewing.
As the batch job only ran for 00:01:07.46 before terminating with the error on the COPY, my guess would be that a one-minute timer expired.
Given that we haven't defined NETSERVER$TIMEOUT, I would expect the NETSERVER process dealing with the FAL activity from this batch job to remain running for the default 5 minutes before terminating due to inactivity, and to have processed the preceding 3x DELETE and 2x COPY commands.
Therefore, out of the NCP EXECUTOR CHARACTERISTICS parameters, I would expect that OUTGOING TIMER (set to 60 seconds) would not be relevant.
I would guess that the INACTIVITY TIMER value of 60 seconds might be relevant, but there's also the DELAY WEIGHT (set to 5), DELAY FACTOR (set to 80) and RETRANSMIT FACTOR (set to 10).
I've had a look at the DECnet for OpenVMS Networking Manual (AA-PV60A-TK, May 1993), specifically at section 188.8.131.52 (NSP Message Retransmission), and I'm a little unclear as to how the parameters relate to each other.
The section does say to look at the Network Services Protocol Functional Specification, which I have, although it doesn't make it much clearer to me.
My understanding is that NSP would underpin "DECnet", and thus FAL, so therefore, the underlying messages from the source node trying to copy the file to the destination node, should have been retransmitted - I'm just a little unclear on how many times the retransmission would have occurred, and what the intra-transmission delay/timeout period might be.
[I guess that DECnet/NSP has weighting factors that depend on historical successful (or otherwise) transmission between the two nodes, and that you can't say for certain that there would be X attempts in total with Y amount of time in between each attempt]
I suppose what I am really trying to ascertain is how long the remote node appeared to be unreachable for, and whether or not underlying messages between the two nodes for the COPY would actually have been retransmitted.
This might help identify whether or not the overall period might be below whatever default threshold Cisco implement before recording an abnormal event (or what our network guys have configured them to use; for me, networking seems to require incantations and sacrificial altars; I've never really needed to understand what goes on under the hood, or why).
Thoughts on the back of a post...
11-27-2018 03:45 AM - edited 11-27-2018 04:02 AM
Re: DECnet Phase IV Executor Characteristics Parameters, FAL & %SYSTEM-F-UNREACHABLE
After having splashed the cash and bought the "Pro" version of some freeware that would allow me to search my old .PST files, I found the email discussion with the engineer (John Weir) in the Networking group in Reading, UK at Compaq (the email was not as recent as I thought I'd remembered - it was in 2001).
I've since had the opportunity to do some testing and capture the Ethernet frames when one of our network guys enabled port spanning on the port that one of the two test nodes was connected to, so I thought I'd give an update.
I indicated that a batch job successfully completed 3x inter-nodal DELETE commands, then 2x COPY commands before finally getting a remote node unreachable indication on a third COPY command, and that I had guessed that a one-minute timer had expired due to the amount of time that the job ran for.
I said that I was trying to ascertain how long the remote node appeared to be unreachable for, and whether or not underlying messages between the two nodes for the COPY would actually have been retransmitted.
Well, I conducted a simple test where I changed the DECnet address of the target reachable node in the volatile DECnet (Phase IV) database to be another (unused) DECnet node number in the same DECnet area, and attempted an inter-nodal file copy whilst Wireshark was capturing the Ethernet frames.
The source node from which the COPY was initiated had:
NCP EXECUTOR CHARACTERISTICS RETRANSMIT FACTOR set to 10
NCP EXECUTOR CHARACTERISTICS OUTGOING TIMER set to 60
The Wireshark capture revealed a DNA NSP messge type %X18 (Connect Initiate) followed by 9x DNA NSP message type %X68 (Retransmitted Connect Initiate), with between 5.737158 and 6.020091 seconds between each message (near enough 6 seconds as makes no difference).
I'd also collected the OpenVMS system time before and after the COPY attempt, and it was as near 60 seconds as makes no difference.
So, this answers both questions - yes, messages were retransmitted, and it "appeared" that the node was unreachable for 60s.
But a little analysis, if you'll allow me...
The [Phase IV] DECnet for OpenVMS Network Management Utilities manual (AA-PV61A-TK, MAY-1993) on page 3-69 says that the OUTGOING TIMER:
"Specifies the timeout value for the elapsed time between the moment a connection
is requested and the moment that connection is acknowledged by the destination
node. A value in the range of 30 to 60 seconds is recommended. The default is
It says on page 3-70 says that the RETRANSMIT FACTOR:
"Defines the maximum number of times any given message (except a connect
initiate message) will be retransmitted before the logical link is disconnected.
The range is 1 to 65,535. The default is 10."
This "except a connect initiate message" caveat doesn't appear to be true - in my (admittedly basic) test, the Connect Initiate message was retransmitted (albeit with a DNA NSP type of %X68 rather than re-sending a type %X18).
The DNA NSP Functional Specification Phase IV (v4.0.1, JUL-1984) describes the NSP retransmit threshold" under the "LOCAL-READ-RETRANSMIT" Network Management Interface function thusly:
"This function returns the maximum number of times the source
NSP will restart an expired retransmission timer before
deciding that the remote node is probably unreachable (see
CONFIDENCE in Section 3.1). When this number is exceeded, NSP
gives a "probable disconnection" return to a Session Control
CONFIDENCE call. Session Control may then disconnect the link
on behalf of the end user. The retransmit threshold is called
the NODE RETRANSMIT FACTOR in the DNA Network Management
It makes no mention of this NOT applying to Connect Initiate messages.
It seems therefore that it is probably a documentation error in the DECnet for OpenVMS Network Management Utilities manual - it appears that (certainly as regards a Connect Initiate) the RETRANSMIT FACTOR determines the total number of attempts (not the number of retries) that are attempted within OUTGOING TIMER number of seconds.
So, it appears that in the original batch job problem, 10 consecutive (Retransmitted) Connect Initiate messages over a period of 60 seconds were either not delivered to the target node in an inter-nodal FAL file copy, or the Connect Confirm response wasn't delivered back to the originating node.
Unfortunately, the statistic counters on the Cisco Access Switches that the two nodes are connected to (nor the intermediate "Level Two Switches" (our network guys' terminology, not mine)) indicate any frame loss (I did have a bit of a Google, and found something about internal queues that if they were full could seemingly silently drop messages (you wouldn't see packet loss in "normal" counters, you'd need to look at counters for these queues), but even these queue counters didn't appear to show any frame loss).
So, I thought that perhaps the recipient node itself dropped them because of load issues - checks of the application log files do not reveal certain user commands known to cause high CPU load to have beeen executed anywhere near the time, and unfortunately, the per-minute network counter snapshots that I take were long since overwritten (after 2 weeks).
The DNA NSP Functional Specification Phase IV indicates under the"REMOTE-READ-MESSAGES RECEIVED" function that:
"This function returns the total number of all types of NSP
messages received from the remote node regardless of their
disposition. This includes detected duplicates."
There doesn't appear to be any separate count of duplicates, so it doesn't appear to be possible to tell whether or not the (Retransmitted) Connect Initiate messages were eventually received (albeit late).
However, even if there was, when I'd mentioned to John Weir about having a recollection of there being an undocumented interface (possibly via the DECnet NML object) to get the equivalent of NCP command output rather than having to parse the output of NCP commands and whether or not there was something equivalent for NCL, he said:
"There probably is an undocumented internal interface, but I suspect that getting information about it would take longer than the age of the Universe.
If you don't have that long, you might like to consider alternatives, such as parsing the output of NCL".
[He did then go on to say:
"The official 'programming interface' for control and monitoring of DECnet Phase V is to make a DECnet connection to CML (session control application 19) and exchange DEC CMIP messages. To be able to make any sense of this you will need:
1. To be a good programmer of DECnet logical links
2. Have a copy of the DEC CMIP spec to give you the general structure of messages
3. Have a copy of 3 or more ISO standards on ASN1 coding and rules
4. Have a copy of the 'Digital' Phase V specs for NSP, Session Control and OSI Transport (3 specifications) to tell you the entities and attributes that are encoded in messages
5. Have the quick wit to recognise the entities and attributes that are not documented (because they were 'invented' after the specs were frozen) and work out what they are
6. Be able to run whatever CTF traces you need to find out what is going on -- because whatever you think, the messages that you actually have to exchange will be nothing like you expect, even after following all the above.
Parsing NCL is not elegant, but you will be able to do it in a reasonable time."]
So, in short, the target node appeared not to respond for over a minute (I didn't check whether or not other jobs ran at the same time which did have successful FAL operations whilst this job's one COPY command failed), but it's not clear whether or not it was a network infrastructure issue, an OpenVMS load issue (I don't believe so), or some artefact of it being CharonVAX running on top of Windows (I haven't looked, and I suspect that Windows Event Logs are not granular enough to show any NIC, Memory, CPU, Raid Array problems that good old ERRLOG.SYS on genuine hardware would).
I'll have to chalk it up as another problem that I've never got to the bottom of (fortunately, in 29 years, I'm lucky enough to have still been able to count this on the fingers of one hand; I'll now have to use the other hand for any other similar undiagnosed problems).