Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Process termination mailbox

SOLVED
Go to solution
Malleka Ramachandran
Frequent Advisor

Process termination mailbox

I noticed some apparent synchronization issue in our production code.
The scenario is like this:

A detached process is created to execute a program. After successful creation of this detached process, the application reads the process termination mailbox asynchronously (QIO with a IO$_READVBLK, no function modifiers).
After issuing the QIO, the application (current process) goes to hibernate. The AST routine defined for the QIO read is supposed to wake up the hibernating process.
We received complaints from some of our customers indicating that the application goes into an indefinite wait state occasionally. I investigated this and found that the AST routine does not seem to fire. I am wondering if the process completion never gets written or if the channel gets deassigned so that the read can never complete. Where can I find more information about the details of the process termination mailbox?

Thanks,
Malleka
17 REPLIES
David Jones_21
Trusted Contributor
Solution

Re: Process termination mailbox

Make sure you start the qio before you created the process. I beleive process rundown writes to the mailbox with a nowait modifier so you can lose the termination message if you don't have a read pending and the detached process dies quickly.

Make sure the iosb for the asynch read is static or still valid if it allocated on the stack.
I'm looking for marbles all day long.
Arch_Muthiah
Honored Contributor

Re: Process termination mailbox

Malleka,

you mean indefinite "hib"ernation state?, did you make sure AST not at all invoked?


Archunan
Regards
Archie
Volker Halle
Honored Contributor

Re: Process termination mailbox

Malleka,

consider to check the subprocess accounting record to see, if and how long it was active.

The process termination mailbox message is written from the DELETE kernel mode AST in the context of the process being deleted, if PCB$L_TMBU (termination mailbox unit number) is non-zero. It's an asynchronous $QIO with the IO$M_NOW modifier, so it won't wait for the reader to read the message.

If the subprocess has been deleted before the read-QIO AST was set up or if the subprocess dies very early, before even PCB$L_TMBU is set up or if there is an error sending the termination mailbox msg, your main process might get stuck.

When the main process is hung, look at the termination mailbox device (MBAxxx:) and at it's operation count with SDA:

$ ANAL/SYS
SDA> SET PROC/ID=
SDA> SHOW PROC/CHAN
SDA> SHOW DEV MBAx:

An operation count of 0 would indicate, that no msg has been written. An operation count of 2 would indicate, that the msg was written and read.

Volker.
Ian Miller.
Honored Contributor

Re: Process termination mailbox

Note that SHOW DEVICE/FULL MBAxxx will display
the operation count (no need for SDA :-)

Do start the read on the termination mailbox before creating the process.

Do you know anything more about the state of the process hibernating waiting for the termination message? Are there any outstanding I/O requests?
____________________
Purely Personal Opinion
Malleka Ramachandran
Frequent Advisor

Re: Process termination mailbox

Thanks for your prompt responses, they have been very useful in understanding what was going on in the application.

When the AST does fire, I get the condition in IOSB status as 44, and the count and dev_info fields are both 0s.
What does status 44 (SS$_ABORT) mean?

Thanks,
Malleka
Volker Halle
Honored Contributor

Re: Process termination mailbox

Malleka,

a mailbox read QIO may be terminated with SS$_ABORT (instead of SS$_CANCEL), it the channel to the mailbox is de-assigned.

Did you check what happened to the sub-process ? Did it get created ? Did it terminate with an error ?

Volker.
Richard J Maher
Trusted Contributor

Re: Process termination mailbox

Hi Malleka

When does the AST fire? At rundown or do you have any $cancels in the code?

Regards Richard Maher
Willem Grooters
Honored Contributor

Re: Process termination mailbox

One reason I can think of, that the connection is aborted: The detached process runs into a fatal condition that casues the OS to interfere - so writing to a termination mailbox is out of the question. AACVIO (access Violation) might be a reason to abort the program abruptly.
Your detached process should keep it's own logging (logfile, for instance) to find out what's the cause.
Another way to find out is checking accounting on the termination of that process. I think it will show the actual final state - which should (IIRC) not be SS$_ABORT

BTW: Good VMS programming practice prescibes the check of IOSB, and not just in case of asynchronous access ;)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Robert Gezelter
Honored Contributor

Re: Process termination mailbox

Malleka,

To amplify what Willem mentioned in his earlier posting.

Proper OpenVMS programming practice requires two checks:

- When doing the SYS$QIO[W] call, the checking of the RETURN code (R0) from the system call
- When IO completion occurs (and not before the completion is indicated by either the AST or the event flag; see my comments about interfaces in my architecture and AST-related speeches available at http://www.rlgsc.com/presentations.html ; to put it simply, until the completion is indicated by the kernel, the contents of the IOSB ARE UNDEFINED).

Although you appear to be getting a completion code in the IOSB, you may actually be seeing pre-existing junk data. The IOSB contents are only valid IFF (If and Only If; as the mathematicians say) you invoked one of the QIO (or QIO-like services) AND got a successful completion code. The Success completion refers to the queueing of the operation, not its ultimate success (which is indicated by the IOSB contents upon signaled completion).

You also haven't mentioned a variety of other factors (e.g., whether this is a multi-processor, the relative process priorities) which could also produce erratic results depending upon system load, among other things.

- Bob Gezelter, http://www.rlgsc.com
Ian Miller.
Honored Contributor

Re: Process termination mailbox

Bob,
I thought the IOSB status field was set to 0 (SS$_PENDING) when the IO request was started i.e. when SYS$QIO had returned sucesss.

____________________
Purely Personal Opinion
Robert Gezelter
Honored Contributor

Re: Process termination mailbox

Ian,

Yes, I believe that you are correct about the pending status. However, I am being conservative (I suppose you could ignore event flags AND ASTs and just poll the IOSB, but I would not recommend it).

However, in light of some of the practices that I have seen, it pays to be cautious. I will admit that I do not have access to a source listing where I am (and I do not have my IDSM book handy), but I would suspect that the guarantees about atomic update of the IOSB are only good within the Requesting Process, and (I will defer to somebody who can check the code easily) possibly with some other qualifications (Yes, I have seen some very interesting code over the years).

I can, with certainty, state that when the AST queued and/or Event Flag is set, the IOSB contents are completely valid (and thus, I have never relied upon the pending status check).

- Bob Gezelter, http://www.rlgsc.com
Volker Halle
Honored Contributor

Re: Process termination mailbox

Bob, Ian,

the IOSB is being probed and cleared during QIO processing, before the actual IO operation is even being started.

As you are not supposed to specify the same IOSB for multiple concurrent outstanding operations, atomic updates are irrelevant.

Volker.
Robert Gezelter
Honored Contributor

Re: Process termination mailbox

Volker,

Actually, atomic updates ARE an issue, but not in that way.

I have seen far too many cases of code that presumes that a data structure is atomically updated by a different thread, when in reality there is no such guarantee.

In this case, my potential for mis-aligned data structures, multiprocessors, and other potential situations (the Event Flag and/or AST guarantee that the IOSB is completely valid).

When I taught AST programming, I try to warn people to expect somethings that they might not expect to happen. For example, a common COBOL practice (I said COMMON, not good) is to use character variables as switches (e.g. strings containing "YES" or "NO ") rather than binary integers.

Such use fails to take note of the fact that character string copies are non-atomic on virtually ALL computing architectures (including the three relevant ones for OpenVMS: VAX, Alpha, and IA64). When using such strings for synchronization between AST level code and mainline code, it is possible to encounter string values other than the expected, to wit: "YO ", "NES", and "NOS". Similar behaviors can be seen on multiprocessors with improperly aligned data, and complex data (thus my extremely cautious recommendations for coding practices). When the occur, these problems can be devilish to identify and correct.

Hence, my comments.

- Bob Gezelter, http://www.rlgsc.com

Malleka Ramachandran
Frequent Advisor

Re: Process termination mailbox

Hello all,

Thanks for all your help. I 'm sorry I could not get back to you earlier.

Robert, the process priority and processor environment are not issues. The application environment creates multiple sessions, each of which creates detached processes at the same priority of 4. I used a 4100 Alphastation with no multiprocessor options to reproduce the problem.

While trying to get a reproducer to post to this forum, I noticed something strange in the code, the mailbox device number argument to the $creprc call which is obtained from an earlier call to a $crembx and $getdvi calls (termbox_num in the attached code) used an int for the device number while what is expected is an unsigned short. This is the same sequence in our application. So what was happening was, two mailboxes were created, and an asynchronous read request is issued to the second one. When the $CREPRC is invoked, the mailbox unit number was having some garbage instead of the actual unit number returned from the detach_mbx_create. After I changed it to short, this reproducer as well as my application, consistently provides correct results. In this reproducer, I can only verify that the AST is fired, in the actual application, I looked at the IOSB values when the AST fired, and everything seems to be OK.
The SS$_ABORT condition in the IOSB is a totally different story. Again I apologize for not being responsive. As an alternative to the termination mailbox read, there is a portion of code which gets executed optionally. What it does is, check the newly created detached process every second and as soon as the status of nonexistent process is received, wake up the current process. Also, (not shown in the reproducer), immediately after the $hiber, there are two calls to $DASSIGN to deassign the channels to the above two mailboxes. I think this was causing the CANCEL wherein I was getting the SS$_ABORT.
I do have another question. I am not sure if the $WAKEUP with 0 argument indicating current process is guaranteed to work. I am working on that now, to pass the PID of the current process to the AST. I see some strange code there, which makes me think that it ws originally intended to use the PID in the wakeup call but for some reason abandoned. I don't know if they changed their mind because the 0 option is good or because there were other issues.

Thanks,
Malleka


Malleka Ramachandran
Frequent Advisor

Re: Process termination mailbox

I am not sure how to include multiple attachments, the actual c code did not seem to get in my earlier reply, here it is.
Volker Halle
Honored Contributor

Re: Process termination mailbox

Malleka,

termbox_num is declared as int (longword) and is allocated on the stack. It's not being initialized/cleared before usage, so the initial value becomes whatever was on the stack at that address 80(FP) before (for my test case it's 7AE315B0).

detach_mbx_create declares *mbx_unit as short (word) and will therefore only update the contents of the low-order longword on the stack where termbox_num is supposed to be stored (in my case it's 0x381f = unit number of MBA14367)

After the call to detach_mbx_create, the contents of termbox_num is used as a int (longword) again, so you get the correct value in the low-order word, but the previous contents of the high-order word of that longword at 80(FP) - in my test case it becomes 7AE3381F.

Now this value for termbox_num is being passed to $CREPRC by value and it (the WHOLE longword !) will actually be stored in the PCB$L_TMBU field of the subprocess's PCB - I've verified this ! When terminating the sub-process and trying to write the termination mailbox message to that unit number (0x7AE3381F), this operation will - of course - fail, the code in SYSDELPRC is also using a MOVL.

But you have apparently discovered an OpenVMS bug !!!

$ HELP SYS $CREPRC ARGUMENT clearly states:
...
mbxunt

OpenVMS usage:word_unsigned
type: word (unsigned)
access: read only
mechanism: by value

BUT the code in [SYS]syscreprc handles this parameter as a LONGWORD, thus causing your problem to surface.

The VAX version of SYSCREPRC is 'o.k'.

MOVW MBXUNT(AP),PCB$W_TMBU(R10)

but the Alpha version is 'wrong':

MOVL MBXUNT(AP),PCB$L_TMBU(R10)

or the documentation (HELP) is wrong and it's your fault ;-)

The C protoype in the V8.2 system service reference manual shows this:

C Prototype
int sys$creprc ( ..., unsigned short int mbxunt, ...);

which matches the documented usage of mbxunt as a WORD.


Using $WAKE without a PID should be o.k., if you want to wake yourself.

Volker.
Volker Halle
Honored Contributor

Re: Process termination mailbox

Malleka,

the second paragraph in my previous reply should read:

detach_mbx_create declares *mbx_unit as short (word) and will therefore only update the contents of the low-order WORD on the stack where termbox_num is supposed to be stored (in my case it's 0x381f = unit number of MBA14367)

Volker.