Operating System - OpenVMS
1832641 Members
2699 Online
110043 Solutions
New Discussion

Multi-threading and Batch Queues

 
Mark C
Occasional Advisor

Multi-threading and Batch Queues

I have an interesting situation (at least I think so!). I have an executable that has two threads, when I turn on full multi threading capability I occasionally get 295124 (JOBDELETE, job deleted before execution) return codes on some jobs that I have submitted to batch queues. The only ones that have the problem are multiple file jobs created using 'create_job, add_file, close_job' and then synchronize_job with an AST to set a flag when they complete.

The same jobs never fail if I have compiled the executable with /THREADS_ENABLE=UPCALLS or without the /THREADS_ENABLE linker switch at all. Only when I only specify /THREADS_ENABLE which will give me two kernel threads as I am on a two-cpu Alpha machine (OVMS 7.2-2).

I also run many single file jobs to batch queues using 'enter_file' and 'synchronize_job' and never have a failure on those with full multi-threading.

Any ideas?

25 REPLIES 25
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

welcome to the OpenVMS ITRC forum.

Do these batch jobs have any batch logfile ? If so, what kind of error message do you find at the end of the logfile ?

Volker.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Thanks for the welcome. I have been a lurker for a while, but haven't had the need to ask a question, till now. This one has me stumped.

Unfortunately, I only get the failures when I run the test repeatedly (kind of a stress test). In this case, the log file that corresponds to the job may be from a previous run. The log file does not have any error data in it, if it is related to this occurrence. I am working on some modifications which will uniquely name the log file so I can tell if it corresponds.

I also get this failure, same scenario, on the same job types, on OVMS 8.2 on an IA64 machine.

I have studied the code and application logs that are produced and don't see anything 'deleting' the jobs. And it happens very intermittenly, If I re-run the same job, it usually completes normally.

I will re-post once I have run with unique log names. If the return code is accurate, the job is not running, so there shouldn't be a log file!

Thanks
Mark
Robert Gezelter
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

I am not completely clear on your situation.

Your post implies that the batch jobs are created by some form of submission program. Is that submission program the program that you are compiling?

Otherwise, I have similar questions to Volker.

- Bob Gezelter, http://www.rlgsc.com
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Bob,

The executable is part of an Agent that performs tasks on behalf of a scheduler. The agent can be sent interactive tasks (jobs or commands) or tasks that can be submitted to batch queues.

I use sys$sndjbc (create_job, add_file, close_job) to submit the multi-file jobs to the batch queue. I then use sys$sndjbc (synchronize_job) with an AST program to let me know when they are complete and what the job status was.

One thread creates and submits the jobs, the other thread watches for job completion and gathers up the results, logs, etc.

Everything works well (even under my stress testing where I continuously re-run the same jobs) unless I have turned on full multi-threading. I get better throughput with full multi-threading, but I get the occasional failure mentioned above.

Never fails with just UPCALLS, and I get some throughput improvement, but would like to be able to use true full multi-threading!

Cheers,
Mark
Robert Gezelter
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

While it is a good idea, it is not necessary to create unique log file names.

The version numbering (together with the creation date/time) allow you to sort out which is which.

Make sure that you have limited the version numbering.

- Bob Gezelter, http://www.rlgsc.com
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

The version number helps, but the repeats sometimes run pretty fast. Often in the same minute. I have my versioning set to limit it to 5, so usually by the time I have noticed the failure, I no longer have the original log files, just copies that have been sent to the scheduler. There are some time stamps in them, but I want to be absolutely sure I have the correct log file (or none at all if it is failing/being deleted) so I am adding an 'instance number' to the log file name.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Volker, Bob,
There is not a log file created when the status returned is 295124, so the job apparently is being deleted. I still don't understand why these jobs are being deleted intermittently only when they are being submitted by a fully enabled multi-threaded application.

Robert Gezelter
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

If I am understanding you correctly, then the problem is happening as you submit the job.

You may want to package up the source code, and contact HP support. Be explicit that the problem appears:
- intermittently
- appears to be connected to your use of /THREADS_ENABLE

- Bob Gezelter. http://www.rlgsc.com
Wim Van den Wyngaert
Honored Contributor

Re: Multi-threading and Batch Queues

Did you check accounting to find out why the process failed ? Or audit ? Or operator log file ?

Wim
Wim
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

to determine, who is deleting those jobs prior to execution, you could set an ALARM ACE on the batch queue:

$ set secu/acl=(alarm=security,acc=delete+success)/class=queue batch-queue-name
$ REPLY/ENABLE=SECURITY

You would get an OPCOM security alarm, if a job in that batch queue gets deleted.

You can delete the ACE using the same command as above and adding /DELETE

Volker.
Phil.Howell
Honored Contributor

Re: Multi-threading and Batch Queues

in the synchronize_job function do you use
a) the entry number
b) the queue name + entry number
c) the queue name + job name
If the answer is c then I suggest that you modify your program to uniquely name the job (this will also give you a unique log file)
help /sync
SYNCHRONIZE

Holds the process issuing the command until the specified job
completes execution.

Requires delete (D) access to the specified job.
(I've always wondered why it needed this)
Phil
David Jones_21
Trusted Contributor

Re: Multi-threading and Batch Queues

I'd look at the accounting file, if a process isn't created the PID will be zero. Sounds like the queue manager is deleting the job for some reason, are you sure the close_job is returning successfully?

I'd try queuing the multi-file jobs as held, then have the thread that does the synchonize release them after queuing synchonize_job ast.

I'm looking for marbles all day long.
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

as your program is ONLY (intermittently) failing, if it's running as a truely multithreaded process, this seems to imply, that there is some synchronization issue in your program OR the called system services...

Did you try alarm ACE to determine, who actually deletes the batch job entry before it's execution ?

Volker.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Thanks everyone for your responses.

Wim, I have checked some logs and found no indication. Perhaps I am not checking all the logs (or the right one).

Volker, I am working on the ACE setup and re-run now. Will definitely let y'all know what that tells me. I have not used this facility before, will I get messages on the terminal where the commands were issued or will information show up in the logs?

Phil, I use queuename + entry number. I have double (and triple!) checked my code to make sure that I am not inadvertently deleting the wrong job at times. My application logs show no indication that that is occurring (and it only happens when full multi-threading is enabled).

DAvid, I check the return codes on the close and it is returning a success return code.

Thanks again, I will update later with results.
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

to receive the security alaram, you enable your terminal as an operator terminal with $REPLY/ENABLE=SECURITY (needs SECURITY and OPER privilege). You'll then receive security alarms on this terminal/session.

Volker.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

OK folks, I think I'm going to have to give up on this one. I have a reasonable work-around by using /THREADS_ENABLE=UPCALLS and I have convinced myself that it is a bug in OVMS!!!

I appreciate all the suggestions and ideas and have learned a few things, but everything seems to point at a synchronization problem in OVMS itself. I had one job that seemed to be hung for over 2.5 hours and then ended with the 295124 status!

Sometimes my job streams will run for long periods of time with no problems, and sometimes it fails on the first pass.

the output from the Alarm ACE does not give me data that I can correlate with the jobs (I don't have access to the batch PIDs and the ACE alarms don't give me the entry numbers).

%%%%%%%%%%% OPCOM 26-JUL-2006 15:34:31.07 %%%%%%%%%%%
Message from user AUDIT$SERVER on TEST1
Security alarm (SECURITY) on TEST1, system id: 1025
Auditable event: Object access
Event time: 26-JUL-2006 15:34:31.07
PID: 0000040D
Source PID: 0000ACCA
Username: QA
Process owner: [QA]
Object class name: QUEUE
Object name: SYS$BATCH
Access requested: DELETE
Privileges used: BYPASS
Status: %SYSTEM-S-NORMAL, normal successful completion

$
%%%%%%%%%%% OPCOM 26-JUL-2006 15:34:32.75 %%%%%%%%%%%
Message from user AUDIT$SERVER on TEST1
Security alarm (SECURITY) on TEST1, system id: 1025
Auditable event: Object access
Event time: 26-JUL-2006 15:34:32.75
PID: 0000040D
Source PID: 0000A8CA
Username: QA
Process owner: [QA]
Object class name: QUEUE
Object name: SYS$BATCH
Access requested: DELETE
Privileges used: BYPASS
Status: %SYSTEM-S-NORMAL, normal successful completion

The 'Source PID' of A8CA is what I would expect (my program), but I can't find an ACCA in the system. The ACCA seems to be a display problem as it alternates with A8CA and in different runs with different processes for my program, it also reported a bogus PID that was x'400' higher than the one I expected (exactly x'400' on three different runs!).

I have my IT support group looking into reporting this as a problem in sys$sndjbc (create, add, close). The creation of a single file job with 'enter_file' never fails and it does the 'create, add, close' under the covers in one call.

Thanks again for all the help!
Jess Goodman
Esteemed Contributor

Re: Multi-threading and Batch Queues

This may not help, but just to make a few points about the $SNDJBC service clear:

When you do the SJC$_CREATE_JOB call the job is created and assigned an entry number. It is considered an "open" job at this point. It will not show up with SHOW QUEUE or SHOW ENTRY. However SYNCHONIZE/ENTRY= and DELETE/ENTRY= and their equivalent $SNDJBC calls will work.

The SJC$_CLOSE_JOB call places the job on the batch queue so it is eligible for execution. There must first be at least one SJC$_ADD_FILE call.

If a process issues a new SJC$_CREATE_JOB before the SJC$_CLOSE_JOB then the old open job is deleted and its final job status, returned via a SJC$_SYNCHRONIZE_JOB call, will be 295124 (%JBC-F-JOBDELETE, job deleted before execution).

I suspect that this is what VMS believes is happening for some reason, perhaps due a VMS kernel thread synchronization bug.

I would not allow the job's final status to be returned in the I/O status block of the SJC$_SYNCHRONIZE call. Instead use the output item code SJC$_JOB_COMPLETION_STATUS so it is returned in a separate longword. Although that item code is not documented until the VMS 7.3 System Services manual, it works at least as far back as VMS 6.2. One silly problem with letting the batch job's final status be returned in the I/O status block is that if the batch job does an "$EXIT 0" then the $SNDJBCW call with SJC$_SYNCHRONIZE will never complete.
I have one, but it's personal.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Jess,

thanks for the comments, but I have double checked all those points! I have log entries in the Create, add, close and synch calls and I even use AST in each call to make sure it is complete before I return to the caller. I have pored over the documentation for sndjbc to make sure I am following all the rules.

I had one job that was 'active' for over 2.5 hours with lots of other batch jobs being created, running and deleted in between before it failed with a 295124.

It appears to me that there is a synchronization problem in the create, add_files, close, synch processing that does not exist in the enter_file, synch processing which leads me to believe it is in the create, add, close processing.

But again thanks, that is why I posted the question to the forum to see if someone could come up with something I hadn't checked!!!!
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Mark,

please note that the different Source PIDs reported will most likely be the PIDs of the 2 threads in your process !

Different threads of a single process have different PIDs, although you'll never find the PID of the 2nd thread in ACCOUNTING or similar. Just have a look at your running multithreaded process with

$ ANAL/SYS
SDA> SHOW PROC/THREADS/ID=
SDA> EXIT

This proves, that the explicit (or implicit) DELETE operation seems to happen from both of your threads. Is this intentional ? If so, how do you guarantee, that a SJC$_CREATE_JOB does not happen, if another SJC$_CREATE_JOB operation is still pending ?

Could it be that the SJC$_CREATE_JOB and other operations affecting this 'open' job run in different threads ?

Mixing ASTs and multithreading may be problematic in some cases.

Volker.
David Jones_21
Trusted Contributor

Re: Multi-threading and Batch Queues

"Could it be that the SJC$_CREATE_JOB and other operations affecting this 'open' job run in different threads ?"

The PIDs are for the kernel threads, the thread scheduler decides each time a process thread becomes executable which kernel thread it will execute on. Therefore the same thread can originate requests from different PIDs at different times. I have programs that use mailbox communication which had to be modified to accommodate a varying sender PID in the IOSB.

I'm not sure if $ICC communication has the same problem, but perhaps the shifting pids aren't caught properly in all the needed places in the queue manager.
I'm looking for marbles all day long.
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

Only one thread ever does a create and the sequence of create, add, close and synch are always completed before it can do the next task. Each step checks the return code (status) from its action and returns an error if it did not complete successfully.

The other thread monitors for the completion of the job (the flag set by the AST of the synch),gathers up the job statistics, logs, etc. and deletes the job.

Obviously, the threads themselves can bounce between physical CPUs and KThreads, but they never cross the functional lines described above. I have added __MB() (memory barrier) calls where it seemed appropriate to make sure the memory accesses/updates were synchronized across the physical CPUs.
David Sweeney
New Member

Re: Multi-threading and Batch Queues

Here is the issue:

For multiple kernel threaded processes, the queue manager uses the PID of the initial thread to validate the client requests. So there can only be one open job outstanding among all the kernel threads of the process. The documentation for SJC$_CREATE_JOB, as stated previously, mentions "if a process already owns an open job, that job is deleted" which is the problem you are encountering.

There is no $SNDJBC context or input item code for the SJC$_ADD_FILE and SJC$_CLOSE_JOB functions to tell the queue manager to which open job the operation belongs. This is one reason why the restriction exists.

If you really need this functionality, a request can be made to enhance the queue system to support multiple open jobs. If you would prefer to handle the issue in the application code, a request can be made to the documentation group to explicitly state this behavior.

Thank you,

Dave Sweeney
OpenVMS Engineering
Mark C
Occasional Advisor

Re: Multi-threading and Batch Queues

My code never has two 'open' jobs under creation at the same time. The sequence of create, add_files, close and synch all happen in line sequentially before the thread can do the next task. All steps work successfully or the whole task is aborted. Only one thread ever does the above sequence and it is always done in the above stated sequence.
Volker Halle
Honored Contributor

Re: Multi-threading and Batch Queues

Dave,


For multiple kernel threaded processes, the queue manager uses the PID of the initial thread to validate the client requests. So there can only be one open job outstanding among all the kernel threads of the process


Could you please elaborate a little more ?

Could it be like this: assuming that the process sends a 'create_job' running on one kernel thread and then the PTHREAD thread gets re-scheduled onto another kernel thread and then sends the 'add_file' or 'close' information, the queue manager sees the different PIDs and will abort the transaction ?

Just speculating about a possible scenario...

Volker.