Operating System - OpenVMS
1747997 Members
4751 Online
108756 Solutions
New Discussion юеВ

batch jobs starting before they are supposed to

 
SOLVED
Go to solution
TMcB
Super Advisor

batch jobs starting before they are supposed to

Hi all

Our batch jobs are starting 15 seconds before they are supposed to on a node in our cluster.

Queue manager runs on a different node, but the times are all the same on all nodes ( we run NTP) and I've checked - all time appears to be in sync throughout cluster.

Has anyone seen this before

Thanks
8 REPLIES 8
Craig A
Valued Contributor
Solution

Re: batch jobs starting before they are supposed to

At the top of your batch job I would put in a call to another bit of DCL that does:

$ SET NOON
$ MC SYSMAN
SET E/C
DO SHOW TIME
CONFIG SHOW TIME
EXIT
$ EXIT

Just to *prove* that this is the case.

Which node does the queue manager run on?

Craig
Hoff
Honored Contributor

Re: batch jobs starting before they are supposed to

The cluster is either skewed by 15 seconds or so, or there's something seriously weird with whatever OpenVMS version and whatever patch level is in use. Patch to current, certainly.

A local rule of thumb: always look under the rocks that "appear" unrelated to the problem, and always look at anything that "appears" correct. Confirm that the area is or is not correct.

And here specifically, confirm the (lack of) skew in the cluster:

SYSMAN> SET ENVIRONMENT /CLUSTER
SYSMAN> DO SHOW TIME

Whether the NTP servers are locked is of less interest, as I've seen many cases of skewed ntp times among pools of servers, too.
abrsvc
Respected Contributor

Re: batch jobs starting before they are supposed to

There is not enough information to provide you any concrete feedback here. Can you post the output from the following:

1)SHO QUE/A (for the scheduled job)

2) Accounting info that shows the time the job actually starter.

Thanks,
Dan
TMcB
Super Advisor

Re: batch jobs starting before they are supposed to

thanks guys
the "conf show time", showed a difference in this node.

I need to look into why NTP didnt correct this time difference

Thanks so much for replying - I really do appreciate it
John Gillings
Honored Contributor

Re: batch jobs starting before they are supposed to

T,
No matter how good your synching, there is always the possibility of a time discrepancy between cluster nodes. Maybe (hopefully!) not as large as 15 seconds, but certainly large enough to potentially fail a test like:

$ IF F$CVTIME(F$TIME()).LTS.F$CVTIME(ExpectedStartTime)

Remember that synchronizing time across nodes is not continuous, and there's always a tolerance threshold, most likely larger than the granularity of display time format (0.01 seconds).

I would usually code some tolerance into a test like the above. Choose the maximum your time synch code will allow and write the test like this (I've assumed 5 seconds)

$ IF F$CVTIME(F$TIME()+"+0-0:0:5.0").LTS.F$CVTIME(ExpectedStartTime)
$ THEN
$ job has started too early
$ ENDIF
A crucible of informative mistakes
abrsvc
Respected Contributor

Re: batch jobs starting before they are supposed to

If the starting time is critical, then code as John suggests should be included. If the key is that the start time be after midnight, then checking the "local node" time followed by a $Wait is in order. At most, the time difference will be in seconds. A little math and a short delay should result in the start time you require.

Dan
Paul Gotkin
New Member

Re: batch jobs starting before they are supposed to

This is really quite simple. The first node in the cluster that reaches your batch job release time will cause the job to start. You have already deduced the 15 second difference on another node as the root cause.

If you'd like to verify this, queue a job then move the clock past the batch job release time on any other cluster node and watch your batch job start. Try moving the clock ahead on various cluster nodes, one test at a time, your job should kick off regardless.

This is a known vms cluster thingy (technical term), although I don't remember seeing it documented anywhere. The first node in a cluster that reaches batch job release time will cause the job to run regardless which node has the job queued.

One thing I have not tried is to use multiple queue managers within the cluster, although I'm guessing the behavior will be identical.
John Gillings
Honored Contributor

Re: batch jobs starting before they are supposed to

re: Paul,

>The first node in the cluster that reaches
>your batch job release time will cause the
>job to start.

I don't think this is correct. My understanding is it is the clock on the node running QUEUE_MANAGER which determines when a job will start. The issue is, you can't necessarily predict which node will be the queue manager.

Think about implementation, you do really think anyone would replicate all the queue timer events on every cluster node and then attempt to deal with all the potential race conditions? Especially when there is a pervasive assumption in OpenVMS that clocks across a cluster will always be synchronized. Such a model is WAY more complex than necessary and would cause significantly more problems than it would resolve (indeed, what problem(s) would it be a solution for?).

The documentation is fairly specific about the possibility of jobs starting early (indeed, it hints that Paul's observation may be correct, if so, it's news to me!)

See $ HELP SUBMIT/AFTER

...

In an OpenVMS Cluster, a batch job submitted to execute at a specific time may begin execution a little before or after the requested time. This occurs when the clocks of the member systems in the OpenVMS Cluster are not synchronized. For example, a job submitted using the DCL command SUBMIT/AFTER=TOMORROW may execute at 11:58 P.M. relative to the host system's clock.

This problem can occur in a cluster even if a job is run on the same machine from which it was submitted, because the redundancy built into the batch/print system allows more than one job controller in the cluster to receive a timer asynchronous system trap (AST) for the job and, thus, to schedule it for execution. Moreover, this behavior is exacerbated if the batch job immediately resubmits itself to run the next day using the same SUBMIT command. This can result in having multiple instances of the job executing simultaneously because TOMORROW (after midnight) might be only a minute or two in the future.

A solution to this problem is to place the SUBMIT command in a command procedure that begins with a WAIT command, where the delta-time specified in the WAIT command is greater than the maximum difference in time between any two systems in the cluster. Use the SHOW TIME command on each system to determine this difference in time. Use the SYSMAN command CONFIGURATION SET TIME to synchronize clocks on the cluster. For complete information on the SYSMAN command CONFIGURATION SET TIME, see the HP OpenVMS System Management Utilities Reference Manual.
A crucible of informative mistakes