Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Circumstances f$context will not find an existing process

 
SDIH1
Frequent Advisor

Circumstances f$context will not find an existing process

Hi,

A batch job submits another batchjob, which start running immediately, as seen in the log. f$context is used to get the process pid of the submitted batchjob.

Lately (after finally applying update 21 on VMS 7.3-2), we have experienced two occurences where f$context couldn't find the process, and 5 occurrences where it worked fine.


As this only occurs with a specific process, I suspect some kind of RW.. condition or resource problem causing f$context to not find it.

Workaround ideas are of course retry loops, sync/entry etc, but I'm really more interested in the problem that might cause this.

As the exact scenario is hard to reproduce, although I will definitely try, I was wondering if anyone had similar experiences?
21 REPLIES 21
John Gillings
Honored Contributor

Re: Circumstances f$context will not find an existing process

Jose,

What criteria are you using to construct your context? (process name, node, etc...) This may have an influence. Some things can't be read externally if the process is in a resource wait state.

Note that when you submit a batch job, the local symbol $ENTRY is set to the entry number of the job you just created. Since the job belongs to you, F$GETQUI DISPLAY_ENTRY will work without having to set queue context. Therefore, a (probably) cleaner way to get the PID you want is:

$ SUBMIT your-job
$ pid=F$GETQUI("DISPLAY_ENTRY","JOB_PID",$ENTRY)

If the job completes immediately, before the F$GETQUI, or it hasn't started, the symbol pid will be set to a null string.
A crucible of informative mistakes
Jon Pinkley
Honored Contributor

Re: Circumstances f$context will not find an existing process

Jose,

I have never tried using f$context/f$pid to determine the process pid of a submitted job. For that I would probably use f$getqui("display_entry","job_pid",$entry). Of course you would need to synchronize with the process being created, since you are never guaranteed that the batch job will ever start (queue stopped, busy, etc.). For that I would use Ken Coar's LOCK utility.

My guess would be that the process may have been created, but not yet established all the context. This would be more likely to happen when the system was heavily loaded with processes whose base priority was greater than the /base_priority of the queue the job was submitted to.

Can you think of anything "different" about this specific process that is experiencing the problem?

Jon
it depends
Craig A
Valued Contributor

Re: Circumstances f$context will not find an existing process

Jose

Have there been any changes with regards to privileges or UIC group number on any related accounts? Are you having to make use of the WORLD or GROUP privilege to gain the process info?

Craig
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

This is used to establish context (left out pid) loop:

$pname = "BATCH_''$entry'"
$ ctx = ""
$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "*","EQL")
$ temp = F$CONTEXT ("PROCESS", ctx, "PRCNAM", "''pname'","EQL")
...
$ pid = F$PID(ctx)
...

Which worked for the past 6 years. The f$context - f$getjpi construct makes the
subroutine also usable for non-batch processes.

The f$getqui suggestion is nice, thanks.
I wouldn't be surprised if this would show the same behaviour, though, but we can see what happens. I seem to remember that you never establish queue context with display_entry, but that's off topic.

There is no danger of the job not starting, all jobs run in a dedicated queue that accomodates more than enough entries.
Introducing lock utilities in a job can be appropriate when strict syncing is required, but that is not the case here, it's just a check to see if the job is running and doing something.

There were no changes to privileges, and the job runs under the same user.

The system is not overly loaded at the time,
and I don't think any process would not have established context after 10 seconds, which
is the wait time introduced after the first time it went wrong.

As the submitted batchjob spawns a subprocess with a virtual terminal running a screen scraper, the batchjob itself is quite busy managing processes around.

Has nobody ever see f$context miss out in such a way? If so, what was done to remedy this?
Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

You're basically writing a job scheduler. Piecemeal. In DCL. With no synchronization among processes. (To quote Rocket J Squirrel, "Bullwinkle, that trick never works!")

For this case, this could well be a race condition, and this sort of scheduling-related DCL is often vulnerable to timing changes. Upgrades (for instance) can and variously do change the timing (either slower or even faster), and thus are accordingly notorious for exposing races in existing code. Further, DCL itself doesn't do asynchronous event handling at all well, and process and batch management is inherently asynchronous.

My usual longer-term solution is to bring a scheduler package on-line, or to start treating all the individual pieces of DCL that end up scattered around as part of a home-grown scheduler, and to bring all these chunks of DCL under some sort of control.

As a potential work-around for what you're working with right here, use a (shared) RMS file as a database for watching the activity. DCL can deal with that, as can the (cooperating) batch job itself. That's if you can't bring a full scheduler on-line.

As for figuring out what happened here, check the timing on the accounting or audit logs. My guess is that the DCL missed the window.

It's fully possible to have two processes with the same process name, and there has (is?) a race within VMS where you can end up with two processes with the same process name in the same UIC group. Which means the process name information is of suspect value at best. Put another way, if the batch process involved here is cooperating, then code it to cooperate here.
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process


Hi, I personally wouldn't qualify submitting a batchjob and check to see what it is doing as writing a job scheduler.

The problems I have seen with (sometimes expensive)job schedulers far exceed any problems I have seen with submitting batchjobs.

Granted, if there are dozens or hundreds of batchjobs with all kinds of dependencies, you need a robust scheduler, and a nice and really working GUI to manage it.

I know the workaround: just retry, and f$context will find the job. But effectively,
that is a workaround.

I have checked the timing windows, and that does not seem the issue. Like I said, the last time I saw f$contect not find the process was 10 seconds after submitting the job, which runs for about 3 minutes.
I checked both accounting and audit to verify the process was running at that time.

I have in the past tried to create processes with the same name under the same uic, and this will happen when creating say 100 processes with the same name in a tight loop. Apparently, there is a window of opprtunity to do so during process creation.

As the process name in this case is BATCH_4444, this is not an issue here.

What I am curious about is under what circumstances f$context would fail to find
a process by name, and what might be done so that f$context will always find it.

Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

>Hi, I personally wouldn't qualify submitting a batchjob and check to see what it is doing as writing a job scheduler.

And what would you consider to be a scheduler?

A scheduler would clearly submit jobs.

It would monitor the jobs for (at least) completion.

And most any scheduler also has to deal with various error conditions.

In this case, this DCL looks to have already met two out of the three, and you're clearly working on the third.

These home-grown schedulers inherently grow organically. Usually dark corners, piecemeal, and without much light. Which means you walk in one day and usually right after Bad has happened, and discover you have dozens or hundreds of lines of task-targeted DCL scattered around the environment like so much kudzu, rather than having a generic solution.

Your scheduler here clearly has a race condition.

You're going to get to figure out what that is, too.

And your options include coordination with the process being monitored (which is probably the easiest, though it involves modifying DCL or jacketing existing procedures with scheduling code), or up-rate your scheduler to tie into the system audits, or look to use termination mailboxes, or your scheduler is going to be polling and walking the $getqui lists. Or some combination. Or a commercial package or a tailored version of the cron or kronos freeware.
Craig A
Valued Contributor

Re: Circumstances f$context will not find an existing process

Jose

What is it that you are trying to achieve by this usage of F$CONTEXT ?

Could it be achieved another way? Logical name? Flag file, etc..?

Craig
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

Hi,

Of course Hoff is right, I shouldn't be so stubborn.

I found the race condition.

By comparing the audit log and accounting,I found that audit reports a process start time that is 22 seconds later than the accounting start time. As f$context was run 20 seconds after job start, this explains why it failed to find the process.

In the audit log:
Process Start : 21:56:26.45
Process End : 21:58:06.71

In accounting:
Job Start : 21:56.04.77
Job End : 21:58:08.13

Why the process needs 22 seconds to take off is not (yet) very clear to me, but there you go.

If anyone has any suggestions, feel free!
Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

And your options include coordination with the process being monitored (which is probably the easiest, though it involves modifying DCL or jacketing existing procedures with scheduling code),

or up-rate your existing scheduler to tie into the system audits and receive completion audits,

or look to use termination mailboxes,

or modify your scheduler to do better with polling and with walking the $getqui lists for entries. (Many variations exist, such as setting the queue's retain on completion setting, and your scheduler then gets to clean that stuff out...)

Or some combination of these.

These scheduling tasks have a nasty habit of turning into the continuing Adventures in Wonderland, and you'll drop down a particularly good rabbit hole once or twice a year with the daylight saving time switch-overs, and also once every four years with leap year.

If you can't buy a scheduler, then actually think about the generic task and then design and write one. Or twist kronos or cron or something else to meet your needs. Rather than a point solution, start working on a more general solution that fixes this specific problem to start with. Then add to it.

I've been known to use a shared RMS indexed file to coordinate a whole herd of batch processes. The submitting process writes records into the file, and the batch processes then open and share and update their own records as they reach various checkpoints, and the submitter can then sort out the sequencing. Or resubmit jobs. This based on a periodic scan of the shared file. Ugly in the extreme, but functional.

VMS is extremely weak in this area, unfortunately.
abrsvc
Respected Contributor

Re: Circumstances f$context will not find an existing process

I have successfully used a mailbox for synchronization as follows:

1) Make sure the mailbox exists
2) Submit the batch job.
3) Post a read to the mailbox

In the batch job:
1) Post a write to the mailbox as the trigger for the submitter to continue.

The above is a simplistic description of the funciton, but hopefully you get the idea. I have used executabel programs rather than DCL to impement the mailbox communication, along with calls to lib$spawn to submit the jobs. This way you can have timeouts handled etc.

Dan
Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

>The above is a simplistic description of the funciton, but hopefully you get the idea. I have used executabel programs rather than DCL to impement the mailbox communication, along with calls to lib$spawn to submit the jobs. This way you can have timeouts handled etc.

>Further, DCL itself doesn't do asynchronous event handling at all well, and process and batch management is inherently asynchronous.

http://www.eight-cubed.com/examples/framework.php?file=sys_queue.c

Kronos also has (Fortran) code that uses $sndjbcw calls, IIRCs.
John Gillings
Honored Contributor

Re: Circumstances f$context will not find an existing process

Jose,

Having dealt with this kind of issue many times, the problem is NOT as simple as it might first appear. Code can get very complex.

Reinforcing Hoff's comments. You need to think very carefully about synchronising processes. Sure, you can get away with simplisitic mechanisms most of the time, but eventually something will happen to break your assumptions about the timings.

Try to find the most direct way to get the information you need. If it's PID, use F$GETQUI, rather than searching with F$CONTEXT/F$PID. This makes at least one side more predictable, cheaper, and less likely to fail due to weird things like RW states.

You also need to deal with the timing issues - what happens if the process is delayed starting, and what happens if it completes quicker than you expect? Note that no amount of WAIT delays will ever guarantee synchronisation, you need some kind of handshake. You will probably also need some kind of sanity timeout in case something causes the job to get stuck, fail to start, hang, loop, exit unexpectedly etc...

Consider using SUBMIT/RETAIN=ALWAYS. that means you can use SYNCHRONIZE or F$GETQUI, even after the job has completed, to check progress or completion status, at the cost of having to explicitly delete the entry.

On other point... think very carefully about your assumptions!

>As the process name in this case is
>BATCH_4444, this is not an issue here

Remember that process names are only unique within the same UIC group. Maybe it's unlikely, but it's possible that another user could change their process name to "BATCH_4444" (or whatever the name of your target process), either accidently or maliciously. If you have sufficient privilege, and their PID happened to preceed your target process, you could end up hooking onto the wrong process. Who knows what weirdness might result? These kind of "it will never happen" type things tend to happen at the worst possible time!

The same would be true for any mechanism that relies on identifying a specific process by name, since process names are not necessarily unique. On the other hand, batch entry numbers are guaranteed unique (even beyond job lifetime if you use /RETAIN=ALWAYS). You can therefore positively identify your process, regardless of privilege, or any actions by other users.
A crucible of informative mistakes
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

Hi,

Thanks for the hints. I do have a considerable arsenal of goodies that would do a better job, like a utility that shouts a message to a doorbell lock and another one that waits for one or a number of messages for a configurable timeout to all appear on the doorbell lock. As it is old, I have to verify how well it is actually written and how well it works.

I also have an audit message listener lying around somewhere, which has the advantage
that modifications to existing jobs are minimal to none.

I was bitten by this particular job as it has run for years without problems, despite the crude syncing/scheduling whatever you want to call it.

I still haven't figured out why the process only starts after 22 seconds, but I saw that the job that SMS'es the error of the first job needed a similar time to actually start.

It could be network problems, but I suspect
more some kind of cluster communication contention, locking would be the first thing to look at in the monitoring logs, I guess.

There were about 20 job completions/starts in the same minute, mostly on another node of the cluster, and I also noticed the queue
manager was running on that other node. I failed that over to the node these jobs run on (start/queue/manager was enough as that node is the first in the node list to the queue manager) attempting to reduce any communication contention, whatever the cause.




Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

If it's a cluster, time-skews among the hosts participating in the queue can arise.

These skews are why /AFTER=TOMORROW can lead to some surprises, particularly if the node running the queue executing the job is a minute or two behind the node that's releasing jobs for the queue manager.

I usually submit stuff for some number of minutes after midnight; outside the likely skew within the cluster.

With NTP, I'd not expect a skew beyond a minute or two, or usually much less. Though with a reboot that's not set up to deal with NTP quite right, the skew can be large pending the drifting.

Again, skews happen.

Locks are easy and very useful within executables. They're not so easy for coordinating generic DCL processing.
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

>These skews are why /AFTER=TOMORROW can lead >to some surprises, particularly if the node >running the queue executing the job is a >minute or two behind the node that's >releasing jobs for the queue manager.

This applies to self submitting batchjobs that are timed released. The solution to that is quite simple: as you usually do the resubmit as first task in the batchjob, add these lines before the submit command:
$ tmp = f$getqui("")
$ submittime = f$getqui("display_job","after_time",,"this_job")
$ wait 'submittime'

With NTP properly set up, time is within a couple of ms the same on all nodes of the cluster, and the problem is hardly likely to occur.

>Though with a reboot that's not set up to
>deal with NTP quite right, the skew can be
>large pending the drifting.

This is done by creating TCPIP$NTP_SYSTARTUP.COM in sys$common:[sysmgr] and call ntpdate in it, so time gets synchronized to the NTP time server at boot time, like so: ntpdate ntp.my.network

The jobs I have problems with are not time-released, just submitted to run at once,
and time on all cluster nodes is actually the same.

The skew here is not inherent to batch jobs, but to some other contention.
Hoff
Honored Contributor

Re: Circumstances f$context will not find an existing process

>This applies to self submitting batchjobs that are timed released. The solution to that is quite simple: as you usually do the resubmit as first task in the batchjob, add these lines before the submit command:
$ tmp = f$getqui("")
$ submittime = f$getqui("display_job","after_time",,"this_job")
$ wait 'submittime'

More of that scheduling kudzu? :-)
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

He! You brought up the kudzu !
Don't blame me for unkudzuing it :-D
Pietro Agostino
Occasional Visitor

Re: Circumstances f$context will not find an existing process

If you are running this in a cluster you may want to change f$context from :

$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "*","EQL")
to

Node = f$getsyi( "nodename" )
$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "node","EQL")

have a read of the help on f$context but if you specify "*" it will check all nodes for the process name you specified. For example if you check for the existence of a process called ABC on Node1 which doesn't exist on Node1 but the process exists on Node2 it will return that the process is found and return the pid using

$ pid = F$PID(ctx)

Not sure if this is happening in your case or not but something to keep in mind if you are using the DCL provided in a clustered environment
SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process


The check on all nodes is intentional.

The fear of checking the wrong process is definitely real on a lot of systems, but on this system there are about 3 users that have access to the command prompt, who are usually sitting within arm's length and have to pay
a lot of snacks if they mess up process names.

SDIH1
Frequent Advisor

Re: Circumstances f$context will not find an existing process

Hi,

I'm closing this thread, to summarize:

The problem was that after a submit a job took 22 seconds to actually start, causing the submitting batch job to fail.

I was pointed out that this shouldn't be so much of a problem if different tools were used, or a different approach altogether and some of the tips were interesting and valuable, thanks.

Although 22 seconds is ridiculous in my view, the underlying problem is that process creation is asynchronous, and any tool that controls and monitors processes and doesn't take this in to account will fail one day.

DCL would have to change beyond recognition to be able to accomplish this easily, and writing DCL that tries to do this is difficult and error prone.