Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Batch jobs stuck in "Starting" state

 
SOLVED
Go to solution

Batch jobs stuck in "Starting" state

An odd thing happened over the weekend that I have never seen before, so hoping you can shed some light. We had a network event occure over the weekend, power was lost to our network switches, after that event was handled one of my operators called to tell me on one node of my 4 node cluster, batch jobs were stuck in starting state. The system was basicly idle, plenty of process slots free, no processes were in unusual states, only LEF and HIB, all disks were normal. I tried to delete the stuck jobs, they would go to aborting state. Reseting the queue would clear the entry, but re-submiting the jobs would return to starting state. I evently rebooted the system to clear the problem!??! Any ideas what would cause this? What other steps can I use to discover the problem? I checked the operator.log file, no errors logged. Don't know if the leading network issue was involved. TIA -Jim
10 REPLIES 10
Hakan Zanderau ( Anders
Trusted Contributor

Re: Batch jobs stuck in "Starting" state

A SHOW QUEUE/MANAGER/FULL could have shown if
the queuemanager was running or was stuck in a failover or other mode.....

regards,

Hakan Zanderau
HA-solutions
Don't make it worse by guessing.........

Re: Batch jobs stuck in "Starting" state

Thanks, I should have included that I did check the queue manager and it showed normal and running on another node of the cluster, in addtion I moved the queue manager to the node with the problem, no change in status, then moved the queue manager back.
Hakan Zanderau ( Anders
Trusted Contributor

Re: Batch jobs stuck in "Starting" state

James,

I would also have tried to restart the queue
on another node in the cluster to see if the problem follows the node or the queue.

regards,

Hakan Zanderau
HA-solutions
Don't make it worse by guessing.........
Hoff
Honored Contributor

Re: Batch jobs stuck in "Starting" state

Given the version and the ECO level were not posted, it's quite possible that the OpenVMS systems involved are down-version, down-revision, or both.

Historically, the "Stuck in Starting" queue job bug reports are fairly common.

Queue job entries stuck in starting state can be an indication of a down-revision ("buggy") queue manager, and particularly one that has experienced a configuration transition. (Cluster reboot, power failure, rolling upgrade, cluster communications hardware failure, etc.)

It's also possible that there are multiple disjoint queue managers in a cluster, and that can easily cause Badness. (I recommend that there be only one queue manager database in the cluster. See SYLOGICALS.TEMPLATE on V7.2 and later for how to set up the logical names related to the queue manager.)

There are almost certainly still legions of queue manager bugs around, but it's been my experience that the numbers of these bugs tend to be reduced as the local queue manager instantiation approaches "current".

Re: Batch jobs stuck in "Starting" state

Another good suggestion, I found that batch jobs on all the other systems were working fine, only this system seem to have the problem. I could not move the queue to another system in this case because the jobs running in it were node specific, ie. the rdb database is accessable only on this system. It has me stumped why jobs would not run on this one node until it was rebooted. Very unusual for VMS.

Re: Batch jobs stuck in "Starting" state

Whoops, not sure of my last post, posted. So here goes again, I should have stated earlier that the cluster consists of 4 alphas, 3 ES40's, one ES80, OpenVMS v7.3-2 with update 15 installed. The primary cluster comm path is MC, with two 100 Mb ethernet channels as backup. PE driver loaded with LAVC monitoring enabled. Opcom recorded the network failure at about 2am with recovery around 3:30am. Any other ideas why batch entered starting state? -Jim

Re: Batch jobs stuck in "Starting" state

Also for Huffs question, the queue manager file in located on a cluster common, non-system disk along with the other cluster common files, (uaf, rightslist, etc..)
-Jim
Jess Goodman
Esteemed Contributor
Solution

Re: Batch jobs stuck in "Starting" state

Once when this happened at my site I fixed it by restarting the JOB_CONTROL process. From the SYSTEM account on the problem node:

$ STOP JOB_CONTROL
$ @SYS$SYSTEM:STARTUP JOBCTL
I have one, but it's personal.

Re: Batch jobs stuck in "Starting" state

Palm hits forhead - Duh!, Thanks Jess, I don't know why I had not thought of that!
Thanks to every one who took the time to answer. I'll go ahead and close this for now. -Jim

Re: Batch jobs stuck in "Starting" state

Closing thread.