- Integrated Systems
- About Us
- Integrated Systems
- About Us
02-20-2019 07:00 AM - edited 02-20-2019 06:48 PM
We have a number of locations on the plant floor where a per-station process communicates with a PLC, and as a backup, has a VT terminal for data entry if the PLC fails to read information from a bar code or RFID tag.
On the last two consecutive days, we have had problems with one station, where a read queued on the LTA device for the VT terminal has failed (the $QIO returns SS$_NORMAL, but the IOSB status word has indicated either SS$_BADESCAPE or SS$_PARTESCAPE).
In such conditions (IOSB status is neither SS$_NORMAL nor SS$_ABORT), the original developers opted to call $QIOW using the I/O function IO$_TTY_PORT with the IO$M_LT_DISCON function modifier, then call $DASSIGN to deassign the channel that had been allocated to the LTA device for communication with the VT terminal.
After that, a call is made to reconnect the LTA device - call $ASSIGN to assign a channel to it, then call $QIO using the I/O function IO$_TTY_PORT with the function modifier IO$M_LT_CONNECT (it passes the address of a function to be used as a completion AST which will call $SETIMR with the addresss of the connection routine after 10 seconds if the IOSB status is either SS$_ABORT, SS$_TIMEOUT or SS$_DEVACTIVE).
When either the SS$_BADESCAPE or SS$_PARTESCAPE statii are encountered, the code functions as intended, but the attempts to reconnect the LTA device continually fail with a status of SS$_DEVACTIVE.
The I/O User's Reference Manual explains the meaning of SS$_DEVACTIVE as "The QIO request is for a port already in use. The LAT port driver rejects the request immediately."
I've managed to reproduce the problem easily enough on a test system – for the SS$_BADESCAPE, just hit CTRL-[ twice on the keyboard, and for SS$_PARTESCAPE hit CTRL-[ then hold on the keyboard any of the "intermediate" characters permitted for an Escape sequence (%X20 - %X2F) on the keyboard (certainly, it works with '$').
I had forgotten about the SS$_BADESCAPE status when I looked at the I/O User's Reference Manual for SS$_PARTESCAPE:
"If SS$_PARTESCAPE is returned, the application program must issue enough single-character read requests, without timeout, to read the remaining characters in the escape sequence, while parsing the syntax of the rest of the escape sequence. Use of the TRM$_ESCTRMOVR item code prevents SS$_PARTESCAPE errors."
When the code issued the $QIOW for IO$M_LT_DISCON, it was only testing the return status of the $QIOW call, not the IOSB status word, so I had assumed that perhaps the LAT disconnect had failed, possibly because of the remaining unread characters that the I/O User's Reference Manual said needed to be read.
[The LTA terminal is configured /NOESCAPE, so I'm wondering why the terminal driver is even attempting to check the escape sequence (though the terminal is also configured /NOPASTHRU – not sure if that is altering the terminal driver's behaviour in relation to the /NOESCAPE)]
I made some changes to the code on the test system, to add extra diagnostics, and to check the IOSB status on the LAT disconnect.
The IOSB status on the LAT disconnect was SS$_NORMAL, and enabling a pause in the code to allow me enough time to examine the process in SDA before it makes the LAT reconnection attempt confirms that the channel to the LTA device is disconnected and no longer appears in a SHOW PROCESS /ID=pid /CHANNEL
Issuing a SHOW PORT n STATUS on the DECserver at the time confirms that the "Current Node:" and "Current Port:" values are blank.
Likewise, issuing an MC LATCP SHOW PORT LTAnnnn: shows blank values for the Actual Port Name and Actual Node Name, and a SHOW DEVICE LTAnnnn: /FULL shows a Reference Count of zero.
I thought that perhaps the unread data from an SS$_PARTESCAPE might be causing some kind of internal confusion state within LAT, so changed the code further to call $CANCEL prior to doing the IO$M_LT_DISCON and $DASSIGN (quicker than the suggested method of reading characters one at a time).
Still, the SS$_DEVACTIVE condition persists.
What the code is doing under these SS$_PARTESCAPE and SS$_BADESCAPE conditions is no different to what it does prior to process rundown (leastways, not that I've detected).
If I stop and restart the process, the IO$M_LT_CONNECT completes without error.
[It is interesting to note that in the I/O User's Reference Manual, it says that "IO$M_FLUSH_DATA can be specified in the P2 argument to IO$M_LT_DISCON. The flush flag indicates that any data not delivered to the remote device is to be flushed when the disconnect is issued."
Presumably that would equally apply for any data not fully received from the remote device (such as the I/O User's Reference Manual's description of what you should do when the SS$_PARTESCAPE status is encountered).
Unfortunately, the documentation appears to be wrong – there is no IO$M_FLUSH_DATA in $IODEF (only IO$M_FLUSH_TAB and IO$M_FLUSH_OUTPUT), and specifying an I/O function modifier as the P2 parameter (without making any reference to the P1, when P1 is an itemlist address, and P2 is the itemlist length) appears to be a rather odd/futile exercise]
A Google search on SS$_DEVACTIVE is less illuminating than a V1.0 low-energy lightbulb – does anyone have any suggestions as to what might be going on?
[I found OpenVMS v4.6 release notes at https://archive.org/stream/AA-KN06A-TE_VAX_VMS_Release_Notes_Version_4_6/AA-KN06A-TE_VAX_VMS_Release_Notes_Version_4_6_djvu.txt which had the following to say in the 7-2 LAT Port Driver QIO Interface section:
"After you issue an IO$_TTY_PORT!IO$M_LT_DISCON (disconnect QIO), the applications port's UCB momentarily goes off line. If you issue a connect QIO for a remote device immediately after a disconnect QIO, it is also possible that the connect QIO may return a SS$_DEVACTIVE status. In this situation, retry the connect QIO."
However, that's precisely what the code is doing, and whilst I haven't waited for the length of the Universe (Y10K issues notwithstanding) to see if the IOSB status returned might change, certainly during testing I've allowed the reconnect attempts to occur tens of times (with 10-second intervals inbetween) with no change in IOSB status.
The only thought I have now (which I didn't check during my testing) was something mentioned in a 27-OCT-1990 comp.os.vms post (subject header "Dedicated LAT-service doesn't break connection on port close") by David Potter:
"One thing to beware of is that if the device connected to the LAT port leaves data waiting for it in the terminal server without reading it, a disconnect from the host will cause the port to go into a disconnecting state, but the connection will not actually be dropped until either the data is read out of the terminal server port buffer, or the port is logged out."
This is something that is documented in the DECserver Network Access Software V1.0 Release Notes (AA-PX0QA-TE, March 1993) which has the following information under chapter 5 (the chapter is entitled "Potentially Confusing Behaviour"):
"5.4 Log out of Sessions on Remote or Dynamic Access Ports
A connected session to a Remote or Dynamic access port with
output xoff'd may not completely logout. The session remains
in a ''disconnecting'' state. The port delays complete logout
until the port xoff condition is cleared, presumably by the
attached device. This is expected behavior so the DECserver
can preserve pending output data. This condition can occur
when the user disconnects the session when the port is xoff'd.
If the xoff condition on the remote port is not expected to be
cleared by the attached device, then the following methods
will log out the port:
- Manually log out the port from a privileged port on the
remote port DECserver. Two logouts may be required.
- If the remote port and attached device use DTR/DSR sig-
nals and with DSRlogout enabled, the port will be logged
out if DSR is toggled by the device.
- Port Inactivity Logout timer expires on the remote port."
My recollection is that under such a condition, the DECserver possibly increments the server counter Solicitations Rejected for any new connection requests when the port is in the Disconnecting state (in which case, I would expect that MC LATCP SHOW PORT LTAnnnn: /COUNT would similarly have its Solicitations Rejected count incremented; I believe that when I looked at this yesterday morning, the LAT port's Solicitations Rejected count hadn't been incremented, though will check when I re-test tonight).
[The port does have XON flow control enabled, so it's feasible that it could get into this Disconnecting state]
Solved! Go to Solution.
02-20-2019 07:02 PM - edited 02-20-2019 07:05 PMSolution
Before doing my testing again tonight, I looked at the help for LATCP SHOW, and saw that SHOW NODE /STATUS would report on the number of active circuits and sessions (but not the details of them).
I tried doing my testing again tonight, with the LATCP SHOW PORT /COUNT, DECserver SHOW SERVER COUNT and LATCP SHOW NODE /STATUS command before and after generating an error on the VT terminal.
I observed that the number of Connected Sessions reported by LATCP SHOW NODE /STATUS before the test (when the detached process was running normally) was 4.
When I forced the error, the number of Connected Sessions dropped to 3 when the process dropped its connection to LTA device for the VT terminal.
After the enforced 2-minute wait (that I'd added to allow me to examine the system before reconnection attempts began), when the process tried to reconnect, the Solicitations Rejected count in LATCP and on the DECserver didn't increase, nor did the reported Connected Sessions count increase.
So, back to the old Sherlock Holmes quote - once you've rejected the impossible, whatever remains - however improbable - must be the truth...
I thought "well, maybe LAT hasn't got confused, maybe what is trying to be IO$M_LT_CONNECTed is already connected, it's just that what I think is trying to be connected isn't actually what is being attempted to be connected.
I looked through the source code to see how the device name to be connected was specified, then searched through the code to see where it was referenced, then I found the bug...
The process calls a logically high level function the developers wrote, to connect the LTA device for the VT terminal, then it calls another logically high level function they wrote, to connect to the LTA device for the PLC.
The two functions are similar, but it appears that rather than encapsulate it in one function with conditional code depending on whether or not it was the VT terminal or the PLC, they copied & pasted the function for the VT terminal, then made changes to it.
Unfortunately, they left in a strcpy() call that copies the device name passed to the function, to a global variable that stores the device name for the VT terminal.
So, if during the life of the process it disconnects the LAT connection to the VT terminal and attempts to reconnect it, the reconnection request ends up specifying the LAT device name for the PLC which (99.99% likely) hasn't been disconnected and consequently, yes, it really is SS$_DEVACTIVE.
When the process was restarted, the IO$M_LT_CONNECT worked because the request to connect the LAT device for the VT terminal was being done anew before the PLC connection attempt overwrote the VT terminal LAT device name with the LAT device name for the PLC.
Now that I've found it, I can add the fix to a bunch of other work I need to do for that process.
It's a latent bug that's been around since 1993, has probably reared its head in the past, but was just dismissed as one of those things that you can fix with a process restart.
All it needed was a fat-fingered line-side operator to bash the keyboard in a fit of rage or to leave something heavy lying on the keyboard, and for me to be involved in supporting the system...
04-30-2019 03:18 PM
Re: OpenVMS/VAX V6.2 Persistent SS$_DEVACTIVE when attempting IO$M_LT_CONNECT
"It is interesting to note that in the I/O User's Reference Manual, it says that "IO$M_FLUSH_DATA can be specified in the P2 argument to IO$M_LT_DISCON. The flush flag indicates that any data not delivered to the remote device is to be flushed when the disconnect is issued."
Thad should be LAT$M_FLUSH_DATA, which is defined in $LATDEF as 1..
05-07-2019 08:39 AM
Re: OpenVMS/VAX V6.2 Persistent SS$_DEVACTIVE when attempting IO$M_LT_CONNECT
Thanks for the update, Jess.
Having just checked, I see LAT$M_FLUSH_DATA is actually mentioned in the VMS V5.5 release notes (https://archive.org/stream/h42_VMS_Version_5.5_Release_Notes/VMS_Version_5.5_Release_Notes_djvu.txt)
A quick check of the v8.4 I/O User's reference manual indicates the same error (IO$M_FLUSH_DATA vs LAT$M_FLUSH_DATA) persists.
Additionally, the description leaves something to be desired - the table of function codes / arguments / modifiers doesn't specify any arguments for the IO$M_LT_CONNECT and IO$M_LT_DISCONNECT modifiers for the IO$_TTY_PORT function code, so even if corrected to LAT$M_FLUSH_DATA to be used in the P2 argument, there's still no reference to what (if anything) should be in the P1 argument...
I had said that P1 was an itemlist address, and whilst that's true in many cases, it's not true in all cases; even if P1 wasn't an itemlist address in the case of IO$M_LT_DISCONNECT, it would seem a bit odd for P1 to be unused, and to simply skip to P2 to set the LAT$M_FLUSH_DAT bit.
I don't expect the documentation to be fixed any time soon, though if someone had a better explanation of the usage of LAT$M_FLUSH_DATA (q.v. a small reproducer), I'd happilly incorporate the use of LAT$M_FLUSH_DAT in the code here.