Operating System - OpenVMS
1753821 Members
9024 Online
108805 Solutions
New Discussion юеВ

Re: Circumstances f$context will not find an existing process

 
SDIH1
Frequent Advisor

Circumstances f$context will not find an existing process

Hi,

A batch job submits another batchjob, which start running immediately, as seen in the log. f$context is used to get the process pid of the submitted batchjob.

Lately (after finally applying update 21 on VMS 7.3-2), we have experienced two occurences where f$context couldn't find the process, and 5 occurrences where it worked fine.


As this only occurs with a specific process, I suspect some kind of RW.. condition or resource problem causing f$context to not find it.

Workaround ideas are of course retry loops, sync/entry etc, but I'm really more interested in the problem that might cause this.

As the exact scenario is hard to reproduce, although I will definitely try, I was wondering if anyone had similar experiences?
21 REPLIES 21
John Gillings
Honored Contributor

Re: Circumstances f$context will not find an existing process

Jose,

What criteria are you using to construct your context? (process name, node, etc...) This may have an influence. Some things can't be read externally if the process is in a resource wait state.

Note that when you submit a batch job, the local symbol $ENTRY is set to the entry number of the job you just created. Since the job belongs to you, F$GETQUI DISPLAY_ENTRY will work without having to set queue context. Therefore, a (probably) cleaner way to get the PID you want is:

$ SUBMIT your-job
$ pid=F$GETQUI("DISPLAY_ENTRY","JOB_PID",$ENTRY)

If the job completes immediately, before the F$GETQUI, or it hasn't started, the symbol pid will be set to a null string.
A crucible of informative mistakes
Jon Pinkley
Honored Contributor

Re: Circumstances f$context will not find an existing process

Jose,

I have never tried using f$context/f$pid to determine the process pid of a submitted job. For that I would probably use f$getqui("display_entry","job_pid",$entry). Of course you would need to synchronize with the process being created, since you are never guaranteed that the batch job will ever start (queue stopped, busy, etc.). For that I would use Ken Coar's LOCK utility.

My guess would be that the process may have been created, but not yet established all the context. This would be more likely to happen when the system was heavily loaded with processes whose base priority was greater than the /base_priority of the queue the job was submitted to.

Can you think of anything "different" about this specific process that is experiencing the problem?

Jon
it depends
Craig A
Valued Contributor

Re: Circumstances f$context will not find an existing process

Jose

Have there been any changes with regards to privileges or UIC group number on any related accounts? Are you having to make use of the WORLD or GROUP privilege to gain the process info?

Craig
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

This is used to establish context (left out pid) loop:

$pname = "BATCH_''$entry'"
$ ctx = ""
$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "*","EQL")
$ temp = F$CONTEXT ("PROCESS", ctx, "PRCNAM", "''pname'","EQL")
...
$ pid = F$PID(ctx)
...

Which worked for the past 6 years. The f$context - f$getjpi construct makes the
subroutine also usable for non-batch processes.

The f$getqui suggestion is nice, thanks.
I wouldn't be surprised if this would show the same behaviour, though, but we can see what happens. I seem to remember that you never establish queue context with display_entry, but that's off topic.

There is no danger of the job not starting, all jobs run in a dedicated queue that accomodates more than enough entries.
Introducing lock utilities in a job can be appropriate when strict syncing is required, but that is not the case here, it's just a check to see if the job is running and doing something.

There were no changes to privileges, and the job runs under the same user.

The system is not overly loaded at the time,
and I don't think any process would not have established context after 10 seconds, which
is the wait time introduced after the first time it went wrong.

As the submitted batchjob spawns a subprocess with a virtual terminal running a screen scraper, the batchjob itself is quite busy managing processes around.

Has nobody ever see f$context miss out in such a way? If so, what was done to remedy this?
Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

You're basically writing a job scheduler. Piecemeal. In DCL. With no synchronization among processes. (To quote Rocket J Squirrel, "Bullwinkle, that trick never works!")

For this case, this could well be a race condition, and this sort of scheduling-related DCL is often vulnerable to timing changes. Upgrades (for instance) can and variously do change the timing (either slower or even faster), and thus are accordingly notorious for exposing races in existing code. Further, DCL itself doesn't do asynchronous event handling at all well, and process and batch management is inherently asynchronous.

My usual longer-term solution is to bring a scheduler package on-line, or to start treating all the individual pieces of DCL that end up scattered around as part of a home-grown scheduler, and to bring all these chunks of DCL under some sort of control.

As a potential work-around for what you're working with right here, use a (shared) RMS file as a database for watching the activity. DCL can deal with that, as can the (cooperating) batch job itself. That's if you can't bring a full scheduler on-line.

As for figuring out what happened here, check the timing on the accounting or audit logs. My guess is that the DCL missed the window.

It's fully possible to have two processes with the same process name, and there has (is?) a race within VMS where you can end up with two processes with the same process name in the same UIC group. Which means the process name information is of suspect value at best. Put another way, if the batch process involved here is cooperating, then code it to cooperate here.
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process


Hi, I personally wouldn't qualify submitting a batchjob and check to see what it is doing as writing a job scheduler.

The problems I have seen with (sometimes expensive)job schedulers far exceed any problems I have seen with submitting batchjobs.

Granted, if there are dozens or hundreds of batchjobs with all kinds of dependencies, you need a robust scheduler, and a nice and really working GUI to manage it.

I know the workaround: just retry, and f$context will find the job. But effectively,
that is a workaround.

I have checked the timing windows, and that does not seem the issue. Like I said, the last time I saw f$contect not find the process was 10 seconds after submitting the job, which runs for about 3 minutes.
I checked both accounting and audit to verify the process was running at that time.

I have in the past tried to create processes with the same name under the same uic, and this will happen when creating say 100 processes with the same name in a tight loop. Apparently, there is a window of opprtunity to do so during process creation.

As the process name in this case is BATCH_4444, this is not an issue here.

What I am curious about is under what circumstances f$context would fail to find
a process by name, and what might be done so that f$context will always find it.

Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

>Hi, I personally wouldn't qualify submitting a batchjob and check to see what it is doing as writing a job scheduler.

And what would you consider to be a scheduler?

A scheduler would clearly submit jobs.

It would monitor the jobs for (at least) completion.

And most any scheduler also has to deal with various error conditions.

In this case, this DCL looks to have already met two out of the three, and you're clearly working on the third.

These home-grown schedulers inherently grow organically. Usually dark corners, piecemeal, and without much light. Which means you walk in one day and usually right after Bad has happened, and discover you have dozens or hundreds of lines of task-targeted DCL scattered around the environment like so much kudzu, rather than having a generic solution.

Your scheduler here clearly has a race condition.

You're going to get to figure out what that is, too.

And your options include coordination with the process being monitored (which is probably the easiest, though it involves modifying DCL or jacketing existing procedures with scheduling code), or up-rate your scheduler to tie into the system audits, or look to use termination mailboxes, or your scheduler is going to be polling and walking the $getqui lists. Or some combination. Or a commercial package or a tailored version of the cron or kronos freeware.
Craig A
Valued Contributor

Re: Circumstances f$context will not find an existing process

Jose

What is it that you are trying to achieve by this usage of F$CONTEXT ?

Could it be achieved another way? Logical name? Flag file, etc..?

Craig
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

Hi,

Of course Hoff is right, I shouldn't be so stubborn.

I found the race condition.

By comparing the audit log and accounting,I found that audit reports a process start time that is 22 seconds later than the accounting start time. As f$context was run 20 seconds after job start, this explains why it failed to find the process.

In the audit log:
Process Start : 21:56:26.45
Process End : 21:58:06.71

In accounting:
Job Start : 21:56.04.77
Job End : 21:58:08.13

Why the process needs 22 seconds to take off is not (yet) very clear to me, but there you go.

If anyone has any suggestions, feel free!