Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

SYSMAN startup sequence aborting

 
Thomas A. Williams
Regular Advisor

SYSMAN startup sequence aborting

Has anyone ever heard of the STARTUP$STARTUP_LAYERED startup database not running to completion when one particular command procedure aborts with a failure? We've got a PEEK/SPY startup command procedure that fails once every 3 months or so (i.e. very infrequently) but when it does, the rest of the command scripts in the layered database do not run. I've tried to replicate this with simple dummy command procedures but am unable to.

We're scratching our heads at this point. Just wondering if someone out there has seen us and can point in the right direction as to where to look.

BTW - we don't have image accounting turned on, takes up too much resources.
13 REPLIES 13
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

Thomas,

Could it be that you're somehow getting an exec-mode bugcheck (typically coming from within RMS)? This kind of bugcheck does not cause a crash by default (BUGCHECKFATAL=0) but instead causes the user process to exit immediately.

You might want to check your error log file for the time inverval around the most recent case of this abort, and see if any nonfatal bugchecks were logged.

A more drastic alternative, but one that's guaranteed to catch an exec-mode bugcheck in real time, is to set the sysgen parameter BUGCHECKFATAL to 1. Since this is a dynamic parameter you can change its value without rebooting.

There might be other reasons for what you're seeing. This is just the first one that came to my mind.
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Thomas,

I have not seen this behavior, however, I would start by doing a full listing of the STARTUP$STARTUP_LAYERED database to see what is being requested.

My suspicions would center around things that could cause odd problems, like quotas, particularly in command files that are executed DIRECT (as opposed to SPAWN). I would also enable the various traces for the startup process, as well as route the startup to a file rather than the console (all of which are documented in the HELP text).

I might also move PEEK/SPY startup, at least temporarily, to a later phase in the startup sequence, to somewhat mitigate collateral damage.

[Disclosure: My firm does consult on matters of this type, as do several other active contributors to this forum].

- Bob Gezelter, http://www.rlgsc.com
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

Oops. I should have made it explicit that once you set BUGCHECKFATAL to 1 and write the active parameters, an exec mode normally nonfatal bugcheck WILL cause the system to crash.

Also, to set BUGCHECKFATAL you can use either of these command sequences:

$ MCR SYSGEN
SYSGEN>USE ACTIVE
SYSGEN>SHOW BUGCHECKFATAL ! It's worth checking whether it is already set
SYSGEN>SET BUGCHECKFATAL 1
SYSGEN>WRITE ACTIVE

or

$ MC SYSMAN
SYSMAN>PARAMETER USE ACTIVE
SYSMAN>PARAMETER SHOW BUGCHECKFATAL
SYSMAN>PARAMETER SET BUGCHECKFATAL 1
SYSMAN>PARAMETER WRITE ACTIVE


Also, remember that you'll probably want to turn BUGCHECKFATAL off again. To do that just repeat either command sequence above, substituting 0 for 1.

Hope this helps,

Galen
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

I have to admit that Robert's line of thinking looks more likely than mine...

That's why people pay him money for this kind of work, whereas since April 1 I only do VMS for pleasure.

(Hmm. That wasn't meant to sound kinky or anything. :-)
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Galen,

Thank you.

Thomas,

Seriously, my rules for such infrequent problems are similar to the rules for first aid:

1) Contain the damage
2) Mitigate its impact
3) Produce useful data for analysis

With a problem that occurs with no reproduceability, it is important that we not take a step that would increase the impact.

- Bob Gezelter, http://www.rlgsc.com
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

Had to do some reboots before I got another possible reason.

If you have mode BATCH and the job terminates with Fatal and the job is retained on error (queue has retain=error) or takes long enough so that startup reaches the sync
THEN
startup will display "Batch job x terminated with error startus." and the startup procedure does an exit with Fatal. And thus the startup is aborted.

IMO not logical at all because in direct or spawn mode this is not done.

fwiw

Wim
Wim
Jon Pinkley
Honored Contributor

Re: SYSMAN startup sequence aborting

RE: SYNCHRONIZE exits with exit status of remote batch job.

Wim,

I agree that this behavior is surprising, but it has been this way for quite some time (possibly since the initial design) and it can be very useful.

However, the DCL help says nothing about this behavior, and the only way I know of to work around it is to wrap the synch statement with set noon and set on statements. And if you don't want confusing error messages in the log file, also using message_state = f$environment("MESSAGE") and set message/nofac... set message 'message_state'

The other option is to write your own synchronize using $getqui

I can understand why $getqui has the capability of returning the exit status of the synch'ed batch job; it can be very useful information to the job synchronizing on the completion. If the purpose of the synchronize is to allow a pre-requisite operation to complete, it may be necessary for the previous operation to complete successfully. This mechanism provides for the signaling of this status to any process that is waiting for its completion.

--------- begin wish list
I wish it were possible to specify the local DCL symbol for the remote status to be returned in, for example something like

$ synchronize /entry='my_entry'/remote_status=my_status ! this does not exist

This would then set the local symbol my_status with the exit status from the batch job with entry 'my_entry', and the $status symbol (and $severity) would only be set to non-success status if the entry did not exist of the process did not have access to it.

Without /remote_status, the behavior must remain the way it is, or it would break existing software.

The use of /noremote_status should cause synchronize to just throw the remote status away instead of using it to set a specified local symbol with the value. This would be the form you would use if all you wanted was synchronization, and did not care what the exit status of the batch job was.
--------- end wish list

Jon
it depends
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

In any case, we once had 250 machines using sysman boot and to my knowledge never had the problem. Probably never had something F, just E.

That's why I prefer my own boot procedure, unix like. You can correct it when thing are not what you need.

Wim
Wim
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Wim,

WADR, I differ. SYSMAN startup is quite well behaved, having used it at numerous sites.

In this particular case, if the theory that the -F- error in a batch job is the issue, the problem is that the behavior is not fully documented. What is most important is that the behavior be documented.

For other reasons, I generally use SPAWN, rather than batch jobs.

- Bob Gezelter, http://www.rlgsc.com
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

If you want to be 100% sure that each startup thing is not disturbed by another one (on the subject of symbols and process logicals) only batch is allowed. 2nd is spawn.

My point is that if you change mode from SPAWN to BATCH that your boot may fail. IMO this should not be the case. Or ALL modes fail for an F, or none.

In any case, my boot continues whatever the status. The procedure explecitly has to require an abort of the boot (by means of a logical). E.g. when mounting of the disks failed.

And, there is no message that the boot is aborted when the batch jobs ends in F.

I also remember that some fatal errors are warnings, it all depends on who programmed it. So, who cares for the status VMS gives us. You have to test success yourself.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

Jon,

Wishlist : add this too : submit batch job with /retain=error. Now is the SM that must set the queue /retain.

Also, how to modify the sync ? It's in startup.com that you may not modify.

Wim
Wim
Jon Pinkley
Honored Contributor

Re: SYSMAN startup sequence aborting

Wim,

I saw the sync but didn't realize you were referring to sys$system:startup.com, I was thinking that one of the files being executed was using the sync statement.

What I was talking about was just the general behavior of the DCL SYNCHRONIZE verb.

Sorry if I caused anyone confusion.

Jon
it depends
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

As you and others of the US are funding the global baillout, you are forgiven.

For those with financial hobbies :

http://www.nypost.com/seven/09212008/business/almost_armageddon_130110.htm

Wim
Wim