Re: Circumstances f$context will not find an existing process

Hoff · ‎07-20-2010

And your options include coordination with the process being monitored (which is probably the easiest, though it involves modifying DCL or jacketing existing procedures with scheduling code),

or up-rate your existing scheduler to tie into the system audits and receive completion audits,

or look to use termination mailboxes,

or modify your scheduler to do better with polling and with walking the $getqui lists for entries. (Many variations exist, such as setting the queue's retain on completion setting, and your scheduler then gets to clean that stuff out...)

Or some combination of these.

These scheduling tasks have a nasty habit of turning into the continuing Adventures in Wonderland, and you'll drop down a particularly good rabbit hole once or twice a year with the daylight saving time switch-overs, and also once every four years with leap year.

If you can't buy a scheduler, then actually think about the generic task and then design and write one. Or twist kronos or cron or something else to meet your needs. Rather than a point solution, start working on a more general solution that fixes this specific problem to start with. Then add to it.

I've been known to use a shared RMS indexed file to coordinate a whole herd of batch processes. The submitting process writes records into the file, and the batch processes then open and share and update their own records as they reach various checkpoints, and the submitter can then sort out the sequencing. Or resubmit jobs. This based on a periodic scan of the shared file. Ugly in the extreme, but functional.

VMS is extremely weak in this area, unfortunately.

abrsvc · ‎07-20-2010

I have successfully used a mailbox for synchronization as follows:

1) Make sure the mailbox exists
2) Submit the batch job.
3) Post a read to the mailbox

In the batch job:
1) Post a write to the mailbox as the trigger for the submitter to continue.

The above is a simplistic description of the funciton, but hopefully you get the idea. I have used executabel programs rather than DCL to impement the mailbox communication, along with calls to lib$spawn to submit the jobs. This way you can have timeouts handled etc.

Dan

Hoff · ‎07-20-2010

>The above is a simplistic description of the funciton, but hopefully you get the idea. I have used executabel programs rather than DCL to impement the mailbox communication, along with calls to lib$spawn to submit the jobs. This way you can have timeouts handled etc.

>Further, DCL itself doesn't do asynchronous event handling at all well, and process and batch management is inherently asynchronous.

http://www.eight-cubed.com/examples/framework.php?file=sys_queue.c

Kronos also has (Fortran) code that uses $sndjbcw calls, IIRCs.

John Gillings · ‎07-20-2010

Jose,

Having dealt with this kind of issue many times, the problem is NOT as simple as it might first appear. Code can get very complex.

Reinforcing Hoff's comments. You need to think very carefully about synchronising processes. Sure, you can get away with simplisitic mechanisms most of the time, but eventually something will happen to break your assumptions about the timings.

Try to find the most direct way to get the information you need. If it's PID, use F$GETQUI, rather than searching with F$CONTEXT/F$PID. This makes at least one side more predictable, cheaper, and less likely to fail due to weird things like RW states.

You also need to deal with the timing issues - what happens if the process is delayed starting, and what happens if it completes quicker than you expect? Note that no amount of WAIT delays will ever guarantee synchronisation, you need some kind of handshake. You will probably also need some kind of sanity timeout in case something causes the job to get stuck, fail to start, hang, loop, exit unexpectedly etc...

Consider using SUBMIT/RETAIN=ALWAYS. that means you can use SYNCHRONIZE or F$GETQUI, even after the job has completed, to check progress or completion status, at the cost of having to explicitly delete the entry.

On other point... think very carefully about your assumptions!

>As the process name in this case is
>BATCH_4444, this is not an issue here

Remember that process names are only unique within the same UIC group. Maybe it's unlikely, but it's possible that another user could change their process name to "BATCH_4444" (or whatever the name of your target process), either accidently or maliciously. If you have sufficient privilege, and their PID happened to preceed your target process, you could end up hooking onto the wrong process. Who knows what weirdness might result? These kind of "it will never happen" type things tend to happen at the worst possible time!

The same would be true for any mechanism that relies on identifying a specific process by name, since process names are not necessarily unique. On the other hand, batch entry numbers are guaranteed unique (even beyond job lifetime if you use /RETAIN=ALWAYS). You can therefore positively identify your process, regardless of privilege, or any actions by other users.

A crucible of informative mistakes

SDIH1 · ‎07-20-2010

Hi,

Thanks for the hints. I do have a considerable arsenal of goodies that would do a better job, like a utility that shouts a message to a doorbell lock and another one that waits for one or a number of messages for a configurable timeout to all appear on the doorbell lock. As it is old, I have to verify how well it is actually written and how well it works.

I also have an audit message listener lying around somewhere, which has the advantage
that modifications to existing jobs are minimal to none.

I was bitten by this particular job as it has run for years without problems, despite the crude syncing/scheduling whatever you want to call it.

I still haven't figured out why the process only starts after 22 seconds, but I saw that the job that SMS'es the error of the first job needed a similar time to actually start.

It could be network problems, but I suspect
more some kind of cluster communication contention, locking would be the first thing to look at in the monitoring logs, I guess.

There were about 20 job completions/starts in the same minute, mostly on another node of the cluster, and I also noticed the queue
manager was running on that other node. I failed that over to the node these jobs run on (start/queue/manager was enough as that node is the first in the node list to the queue manager) attempting to reduce any communication contention, whatever the cause.

Hoff · ‎07-20-2010

If it's a cluster, time-skews among the hosts participating in the queue can arise.

These skews are why /AFTER=TOMORROW can lead to some surprises, particularly if the node running the queue executing the job is a minute or two behind the node that's releasing jobs for the queue manager.

I usually submit stuff for some number of minutes after midnight; outside the likely skew within the cluster.

With NTP, I'd not expect a skew beyond a minute or two, or usually much less. Though with a reboot that's not set up to deal with NTP quite right, the skew can be large pending the drifting.

Again, skews happen.

Locks are easy and very useful within executables. They're not so easy for coordinating generic DCL processing.

SDIH1 · ‎07-21-2010

>These skews are why /AFTER=TOMORROW can lead >to some surprises, particularly if the node >running the queue executing the job is a >minute or two behind the node that's >releasing jobs for the queue manager.

This applies to self submitting batchjobs that are timed released. The solution to that is quite simple: as you usually do the resubmit as first task in the batchjob, add these lines before the submit command:
$ tmp = f$getqui("")
$ submittime = f$getqui("display_job","after_time",,"this_job")
$ wait 'submittime'

With NTP properly set up, time is within a couple of ms the same on all nodes of the cluster, and the problem is hardly likely to occur.

>Though with a reboot that's not set up to
>deal with NTP quite right, the skew can be
>large pending the drifting.

This is done by creating TCPIP$NTP_SYSTARTUP.COM in sys$common:[sysmgr] and call ntpdate in it, so time gets synchronized to the NTP time server at boot time, like so: ntpdate ntp.my.network

The jobs I have problems with are not time-released, just submitted to run at once,
and time on all cluster nodes is actually the same.

The skew here is not inherent to batch jobs, but to some other contention.

Hoff · ‎07-21-2010

>This applies to self submitting batchjobs that are timed released. The solution to that is quite simple: as you usually do the resubmit as first task in the batchjob, add these lines before the submit command:
$ tmp = f$getqui("")
$ submittime = f$getqui("display_job","after_time",,"this_job")
$ wait 'submittime'

More of that scheduling kudzu? :-)

SDIH1 · ‎07-21-2010

He! You brought up the kudzu !
Don't blame me for unkudzuing it :-D

Pietro Agostino · ‎07-21-2010

If you are running this in a cluster you may want to change f$context from :

$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "*","EQL")
to

Node = f$getsyi( "nodename" )
$ temp = F$CONTEXT ("PROCESS", ctx, "NODENAME", "node","EQL")

have a read of the help on f$context but if you specify "*" it will check all nodes for the process name you specified. For example if you check for the existence of a process called ABC on Node1 which doesn't exist on Node1 but the process exists on Node2 it will return that the process is found and return the pid using

$ pid = F$PID(ctx)

Not sure if this is happening in your case or not but something to keep in mind if you are using the DCL provided in a clustered environment

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Circumstances f$context will not find an existing process