Documentation on I/O Request Queue entries displayed by SDA?

Mark_Corcoran · ‎04-02-2017

I spent most of Thursday trying to do a post-mortem on an issue with an application which wasn't resolved by restarting it.

The ultimate cause appears to be a process which was locking resources (either a Global Section, a shareable image, or an RMS record lock in a transaction file queried by many other processes).

The process was created from an inbound Telnet connection, and had a BG and TNA device allocated to it (amongst other devices & channels).

When I looked at the associated TNA device in SDA, the I/O request queue had the following reported:

STATE IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS

C 83A17600 004B0051 E FF20 0381 833891FE 63 8328E688 7FE7E0C0 0201
unload bufio,termio

I appreciate that SDA allows you to view the contents of numerous internal structures used by different parts of OpenVMS, so one can't expect the SDA documentation to detail the format of those structures or what values particular elements may have.

I have tried to find details on the I/O request queue, but can't find anything that provides that same detail as what SDA is showing above (perhaps SDA is reporting a subset of a mixture of structures?).

I am presuming that the MODE value of E indicates Executive mode, and that the FUNC value of 0381 is in some way related to the enumerated "unload bufio,termio"...

Another I/O request queue entry from another process had a FUNC value of 0020 and an enumerated "writelblk bufio" - the 0020 (32 decimal) appears to match the bit used for IO$_WRITELBLK.

0381 would mean 4 bits being set, and IO$_UNLOAD is bit 0, which still leaves bits 7, 8 & 9 to be accounted for.

IODEF has a number of IO$M_* modifiers which share the same value, but the particular modifier being used would likely be dependent on the context (i.e. the type of device - whether it is disk, tape, mailbox, network &etc.)

The TNA device is a Telnet terminal device provided by UCX, and I don't believe that UCX sockets & services API documentation goes into this level of depth (because one shouldn't need to know it).

It looks to me like UCX was attempting to "unload" the TNA device (because it had received a disconnect from the originating PC), and for whatever reason, the unload never completed (meaning that the process was still locking resources used by other processes).

I don't suppose anyone has any more detailed information on the significance of IO$_UNLOAD for a TNA device, and under what circumstances it might not complete (the process was in a LEF state)?

Mark

[Formerly appearing as woeisme]

Volker Halle · ‎04-04-2017

Mark,

SDA itself has no knowledge of TCPIP internals. But there is the TCPIP$SDA SDA extension, which may show a little bit more details. Try SDA> TCPIP HELP

To understand more of the OpenVMS internal data structures, the OpenVMS source listing would be helpful, in addition to the OpenVMS Internals and Data Structure manual (IDSM). But as far as I know, TCPIP source listing are not available.

Volker.

Ian Miller. · ‎04-06-2017

I note that the SDA extension PWAIT$SDA ( from Freeware collections and http://encompasserve.org/~miller/ )

displays information about IRPs but nothing TCPIP specific. The source code of PWAIT$SDA may give you some other hints on decoding IRPs but that's not the same as having the IDSM and the listings.

Did you look at locks held by the process?

I guess you don't have the necessary support contract to be able to ask HPE - if you do then use it.

I see that EXE$GL_ABSTIM_TICS is copied to PCB$L_WAITIME when a process is put into a wait state so it is the start time of the wait. EXE$GL_ABSTIM_TICS contains the number of ticks (10 millisecond intervals) since boot. Could this use

EXE$GQ_BOOTTIME (boot time in VMS binary format) to determine when the wait started as a datetime.

____________________
Purely Personal Opinion

Mark_Corcoran · ‎04-07-2017

Hi Volker, thanks for your reply and apologies for my belated response.

We do have the TCPIP SDA extension (well, actually, UCX$SDA.EXE, because that's how old the stack is) - I had a look, but it doesn't appear to show any more information than UCX SHOW DEVICE_SOCKET BGnnnn:, and in this case, it's the TNA device that was really the issue (SHOW DEVICE_SOCKET will show you that there is an associated TNA device, but gives no other details on it).

I do have the IDSM book, but it is ~200 miles away at home, and I'm not sure it would shed any more light on the problem.

> as far as I know, TCPIP source listing are not available
Even if it was included in the listings, I doubt that employer/customer would pay for them, just to be able to confirm that the cause is what I suspect (old, buggy UCX not tearing down network connections).

[Formerly appearing as woeisme]

Mark_Corcoran · ‎04-07-2017

Hi Ian, thanks for the reply and the PMs...

>Did you look at locks held by the process?
I did, but I don't really know enough about lock management to determine the relevance of it (the process was in a LEF state, as were many other processes, which appeared to be waiting on a "resource" held by this process).

I had presumed that the process was locking something in an application transaction history file.

The source code (Fortran) indicates that it calls the open() routine with a parameter of useropen= where the specified function opens the file in a manner that permits it to read locked records, and that its reading of records does not cause any record locking (it basically sets the RAB$V_NLK and RAB$V_RRL bits in RAB$L_ROP before calling SYS$OPEN then SYS$CONNECT).

However, looking at the snapshot of FCB for the file at the time of the problem, suggests not:

FCB Address: 83B28B00
-----------
FCBFL: 83A52180 SIZE: 0130
FCBBL: 83A6C340 TYPE: 07
EXFCB: 83A63A40 WLFL: 83A26780
REFCNT: 0003 3. ACNT: 0003 3.
WCNT: 0000 0. LCNT: 0000 0.
TCNT: 0000 0. ACCLKMODE: 00
STATUS: 0000
SEGN: 0000 0. STVBN: 00000001
HDLBN: 000CB43F STLBN: 00000000
FILESIZE: 0003A7AC EFBLK: 0003A7AC
VERSIONS: 0000 0. DIRINDX: 00000000
DIRSEQ: 0000 ACCLKID: 00000000
LOCKBASIS: 00000046 TRUNCVBN: 00000000
CACHELKID: 00000000 HIGHWATER: 0003A7AD
HWM_UPDATE: 0000 0. HWM_PARTIAL: 0000 0.
HWM_ERASE: 0000 0. HWM_WAITFL: 83B28B68
FID: [0046,0015,0000]
HWM_WAITBL: 83B28B68

The REFCNT (FCB$W_REFCNT) element has a value of 3, indicating 3 accessors of the file across the node.
The ACCLKMODE (FCB$B_ACCLKMODE) element has a value of 0, indicating that the highest lock mode of any accessor of the file is none.
The LCNT (FCB$W_LCNT) element has a value of 0, indicating no accessors have the file locked against writers.
The STATUS (FCB$W_STATUS) element has a value of 0, indicating that none of the status bits (specifically, FCB$V_RMSLOCK) are set - so no RMS locking is enabled (by this process).
The ACCLKID (FCB$L_ACCLKID) element has a value of 0, indicating that this process has no access lock enabled for the file.
The CACHELKID (FCB$L_CACHELKID) element has a value of 0, indicating that this process has no cache interlock lock enabled for the file.
The HWM_UPDATE, HWM_PARTIAL and HWM_ERASE (FCB$W_HWM_UPDATE, FCB$W_HWM_PARTIAL and FCB$W_HWM_ERASE) elements are all 0, indicating that there are no write operations in progress that affect the highwater mark, no highwater mark erase operations in progress, and no partially validated erase operations).
The HWM_WAITFL and HWM_WAITBL (FCB$L_HWM_WAITFL and FCB$L_HWM_WAITBL) elements have the same value, indicating that there are no pending read or write operations that affect the highwater mark for the file (or at least, not in this process).

There are 14 locks that the process has; there appear to be two "primary" locks (with , one having 7 sub-locks, and one having 5 sub-locks.

The first primary lock appears to be for the transaction history file:

Lock data:

Lock id: 71000D0C PID: 003C0159 Flags: VALBLK SYNCSTS SYSTEM
Par. id: 00000000 SUBLCKs: 7 NODLCKW NODLCKB EXPEDIT
LKB: 839D7240 BLKAST: 00000000
PRIORTY: 0000

Granted at NL 00000000-FFFFFFFF

Resource: 00150046 24534D52 RMS$F... Status:
Length 26 494C5050 41020000 ...APPLI
Exec. mode 00202020 20202043 C .
System 00000000 00000000 ........

Local copy

I don't know the way that RMS creates resource names, but the 150046 is the reverse of the FID (46,0,15) for the file (or is that just coincidence?)

None of the flags in any of the lock entries appears to be different from the others, all of the Status values are blank, all of the Lock Block AST addresses are zero, and they all have a priority of zero.

I'm beginning to think that it wasn't this file after all, but something else.

I did get the output of a SHOW PROCESS /ALL from another process at the time (which I thought was equally responsible for the "hang", as they both had channels open to the transaction file), but I don't know what to look for (and if there's even enough information) to determine what the event flag it was waiting for, was actually related to (if there was a dump of the stack (I don't think SDA does/can give this), then I might be able to work out the function that it was in prior to the (effecive) $WAITFR call).

>I see that EXE$GL_ABSTIM_TICS is copied to PCB$L_WAITIME when a process is put into a wait state so it is the start time of the wait.

I had thought that if I knew when the processes had gotten into a wait state, I might be able to discern some other event occurring on the system (from accounting records, security audit journal, application log files).

Unfortunately, I didn't do a FORMAT of the PCB for both of the suspect processes in SDA before killing them, and whilst I did do a SHOW PROCESS /ALL after having redirected the output to a file, SDA doesn't appear to proactively go through control blocks that the process has, and dump them.

The output does record a "Starting wait time" of "1B001B16", but this doesn't appear to be related to PCB$L_WAITIME, and seems to be common (or very similar values) across many processes (as I don't have source listings, I don't know where SDA is deriving this value from, or its significance).

I think this is problem might have to remain on my (thankfully) small "unsolved" pile.

[Formerly appearing as woeisme]

Volker Halle · ‎04-07-2017

Try the PWAIT$SDA SDA extension next time (if it's an Alpha or I64 system).

Or look for Busy channels (SDA> SHOW PROC/CHAN) or waiting locks (SDA> SHOW PROC/LOCK - waiting locks will always be shown FIRST, so if the first lock is granted, the process is NOT waiting for a lock). It's typically easier, to work your way from a HANGING process instead of from the one, you ASSUME to be causing the problem.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Documentation on I/O Request Queue entries displayed by SDA?

Documentation on I/O Request Queue entries displayed by SDA?

Re: Documentation on I/O Request Queue entries displayed by SDA?

Re: Documentation on I/O Request Queue entries displayed by SDA?

Re: Documentation on I/O Request Queue entries displayed by SDA?

Re: Documentation on I/O Request Queue entries displayed by SDA?

Re: Documentation on I/O Request Queue entries displayed by SDA?