Is Stopping Processes the Best Approach?

Homer Shoemaker · ‎02-07-2008

Hope this is the correct area of the forum to post this. I'm also hoping that some of you might take a look at my approach and comment. Thanks.

THE SITUATION:

1) Our A/R Post runs in batch at night. Since we need exclusive access to several files to run it, we check for file availability and then bounce out of the program(s) if they're not available. Not good when this happens, but afik, unavoidable.

2) Users neglect to logout before going home. Sometimes leaving programs running which have opened the files in question for write access.

3) Users close their terminal emulation program without logging out of their session(s), leaving processes that aren't doing anything but using up telnet sessions.

4) Active processes created by users in other uic groups must not be terminated. They don't operate on any data files that I'm concerned about. (Logicals are wonderful things.)

5) Some specific processes created by users in the group in question must not be terminated. Specifically, a program which creates an FTP session and then downloads and processes data and then creates orders by writing to an orders file.

6) Some batch reporting processes must be allowed to continue.

7) The main program that all the users run is a menu program that creates new sub-processes to run other programs.

8) DS20, OVMS 7.1-2, Compaq COBOL

MY APPROACH:

1) Prior to running the post I send a message to all users in the same group that they're going to be logged out. (Yes, we have users at 4:00 am)

2) Then I use a DCL procedure including F$PID to identify running processes, and F$GETJPI (with various parameters) to try to figure out which ones get STOP/ID=. And then I stop them.

MY PROBLEM:

1) I'm not sure this is the best approach. If someone has an obvious alternative approach (that I'm obviously missing :0)), I'd sure like to hear it.

2) There are some DCL programming issues that have been a little vexing while trying to implement step 2 of my approach. I'll be more specific about the vexing programming issues and appreciative of any help if there's some consensus that my approach is the proper one to take. (This post is getting a little long. Thanks for reading this far.)

Hein van den Heuvel · ‎02-07-2008

That's all pretty standard: the problem, the solution, the concerns, the difficulties.
All of it.
Knowing that will unfortunately not help much, but might make you feel better :-)

Ideally applications would be written anticipating this problem.
OpenVMS has the tools: Take out an application lock for 'green light' with blocking asts. If you get it, continue processing. If the AST fires, dequeue, stop processing, possibly close files and re-queue waiting for the next green light.
Or applications could have a mailbox with write attention ASTs waiting for gracefull shutdown commands. Or they could have timed wait loops on the main menu, polling a global section looking for an exit flag.

You mention Cobol and only cobol, so may we assume RMS file IO? Cobol offers access to the RMS FAB and RAB so conceivable you could build a list of those and attempt to close and re-open and re-position/relock... all of that triggered by some ast or polling event.

And you could possibly use system service intercepts or 'fake_rtl' intercepts to register and/or re-route file opens and othe r activity.

But all of that is probably way to complex and 10 years, if not 20 years, too late to request.

So with what you have today, the biggest risk really is using "DEFERRED WRITE". This is a powerful performance option, but can leave dirty buffers dangling in memory for ever. A similar problem is unshared files.

So for starters be sure to issue STOP/IMAGE/IDENT...
This will use $FORCEX which will give RMS a nudge to clean its act (buffers), only then use STOP/ID.
If your OpenVMS version does not have that command, then just google for a quick tool with $FORCEX.
A quick (working!) starting point:

#include stdio
main(int argc, char *argv[]) {
int pid, sys$forcex();
if (argc < 2) return 16;
sscanf (argv[1], "%x", &pid);
return sys$forcex(&pid,0,0);
}

Next, on the challenge of identifying processes and images.
Using SYS$GETJPI or F$GETJPI is fine, but indirect.
You are interested in particular file right?
So look for accessors of those files:

1) Just use $SHOW DEV/FILE, parse output and kill.

2) $GETLKI in a loop looking for the locks and holders for the file in question. Again, google for pre-existing solutions. For example you could adapt my crude, blocking.c:
http://h71000.www7.hp.com/freeware/freeware60/rms_tools/src/blocking.c
or David Froble's submission:
http://h71000.www7.hp.com/freeware/freeware60/rms_locks/

You may also want to 'hide' file or directories while the critical job is running, to prevent applications from sneaking in behind your back:
$rename [000000]data.dir real_data.dir
$rename [000000]fake_data.dir data.dir
The back jobs would know to operate on 'real_date'.
The directory fake_data would contain a README file, just to avoid panic attacks.
Or it could have or READONLY copies for certain application file to allow certain lookups to continue.

This was all just a brain dump over a (late) lunch break, but I hope this give you some ideas!

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Homer Shoemaker · ‎02-07-2008

Thanks for the quick response.

Yes, RMS I/O only. Yes, it does make me feel better that my approach isn't completely out of left field.

I'm going to try to use some of what you said and see how it goes. Might take me a day. So I'm going to leave this open in case I have more question.

Robert Gezelter · ‎02-07-2008

Homer,

As Hein has mentioned, there are a variety of ways to deal with this. Stopping processes is safe IF AND ONLY IF one can be assured that there are no operations in progress. Idle terminals are one thing, but this can be a dicey proposition.

Personally, if the programs are under your control, I am often more inclined to add a mechanism to allow the idle sessions to be terminated from within the session itself. This is not only of benefit to the batch processes that are the immediate issue, it is also a strong benefit for accountability and security, which comes under the rubric of "accounting controls".

Were I to implement such a mechanism, I would also arrange it so that the a lock is taken out on the data files to prevent new sessions from starting in the interim. This can get site specific, but done properly it operates quite smoothly.

- Bob Gezelter, http://www.rlgsc.com

John Gillings · ‎02-07-2008

Homer,

On V7.1 the DCL STOP command uses $DELPRC. As of V7.2, by default it does a $FORCEX. If you stay on V7.1, you should find or write a $FORCEX program to minimise the chances of data corruption.

The biggest danger with $DELPRC is attempting to kill a process in some kind of resource wait state, which then puts it into a state from which a reboot is the only recovery.

If the processes in question are running code under your control, the best solution to the overall problem is to implement a (long) timeout on all input to the program.

If the input operation times out, have the program clean exit. Since you're in control from the "inside" you can make sure everything is clean and tidy, before exitting the program. Make the timeout the time between end of "normal" day and start of overnight processing, this should be long enough that it doesn't bother users during normal working hours, but makes sure all processes have gone by the time you need exclusive access. Alternatively, make the timeout variable, ending at an absolute time, just before the overnight job is due to start.

This approach avoids any dangers of external summary $DELPRC corrupting transactions, the usual issues of identifying processes to be and can be killed, and it also works for all the cases you've mentioned and automatically affects only those users who need to be affected.

A crucible of informative mistakes

Willem Grooters · ‎02-07-2008

Additional - and probably the hardest bit - is to educate your users. Not just because of your processing, but for security in the first place. They should NEVER leave a session opened, or stay logged in when leaving the premises...

Willem Grooters
OpenVMS Developer & System Manager

Edwin Gersbach_2 · ‎02-08-2008

Huh?

I thought it was that way:

You know you work with VMS if you go for a 3 week holiday without closing the editor.

:-)

Willem Grooters · ‎02-08-2008

It still is. It won't hurt VMS (Add this to your tagline: " and continue when you return, as if you hadn't left".)
But alas, business requirements are something else.

Willem Grooters
OpenVMS Developer & System Manager

Homer Shoemaker · ‎02-08-2008

John,
Thanks. Good info. The specifics help.

I do have control over the application source. But there are hundreds of executables in the application. When I finish this and few other issues, I'll be migrating this server and the AS800 development server to two new Integrity servers with v8.3. It's frustrating to have to fix things now that will be easier to fix after the upgrade.

Willem,
I've stopped beating that dead horse, especially since our company is growing and there are new users every week who get trained by the chronic offenders (if they get trained at all). The owner of the company ALWAYS leaves his session logged in. I'll try to pick the battles I can win.

Willem & Edwin,
Thanks for the levity. It's always fun to remember (gloat about) how stable VMS is!

Richard W Hunt · ‎02-08-2008

We had a similar issue, in our case because of license credits, not files. Our third-party application charges by simultaneously active users, so we have to limit license usage to limit costs.

For us, we decided after much soul-searching to implement our own semi-intelligent job-killer. I'll spare you the gyrations and skip to what it does now. I will say that at one point we estimated that this approach saved us $80K per month.

We kill processes that have been idle for a certain amount of time as determined by taking successive snapshots of the process list to a file, then comparing snapshots.

We scan the process list to an internal data structure that includes information such as

PID
MASTER_PID
STATE (scheduler state)
BIOCNT
DIOCNT
CPUTIM
CONNECT
IMAGNAME
plus some other stuff that are just bells and whistles.

Our rule is, first to back-link all processes to their master processes. This makes a "tree" for each master process (the first process of the "job" and the one to which accounting info is linked, usually.)

Once we have the processes in trees, we check the states. A process can be killed when it is in LEF state with exactly and only one buffered I/O pending and no disk I/O pending... OR if it is a parent process in an HIB state with NO I/O pending and it has exactly one child in the earlier state. If there is no corresponding process in the "previous" snapshot, we skip it because it is a new process. We also have different rules for different UIC-based groups and for processes running specific programs.

If ANY MEMBER of that process tree is not eligible to be killed, the whole tree is not eligible and lives on to the next cycle.

The final test is to compare the current and previous accounting values for a given scan, limited only to eligible members of the list. When there is no change, the process is idle according to our standards. We kill if and only if it passes all of the above tests.

The image that does this isn't particularly privileged - it just does file I/O (for snapshot files), a GETJPI call inside a loop, and the FORCEX/DELPRC sequence for our vict... targets.

As complex as this sounds, we get several kills an hour from it. I don't suggest that you would do exactly this, but it might give you some ideas about how far you can go.

The issue of a limiting number of license credits is just anothe reason to kill idle processes despite the fact that otherwise such processes seem benign. Just like your locked files issue.

Sr. Systems Janitor

Homer Shoemaker · ‎02-13-2008

Thanks, all.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Is Stopping Processes the Best Approach?

Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?

Re: Is Stopping Processes the Best Approach?