Operating System - OpenVMS
1751802 Members
5320 Online
108781 Solutions
New Discussion юеВ

SYSMAN startup sequence aborting

 
Thomas A. Williams
Regular Advisor

SYSMAN startup sequence aborting

Has anyone ever heard of the STARTUP$STARTUP_LAYERED startup database not running to completion when one particular command procedure aborts with a failure? We've got a PEEK/SPY startup command procedure that fails once every 3 months or so (i.e. very infrequently) but when it does, the rest of the command scripts in the layered database do not run. I've tried to replicate this with simple dummy command procedures but am unable to.

We're scratching our heads at this point. Just wondering if someone out there has seen us and can point in the right direction as to where to look.

BTW - we don't have image accounting turned on, takes up too much resources.
13 REPLIES 13
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

Thomas,

Could it be that you're somehow getting an exec-mode bugcheck (typically coming from within RMS)? This kind of bugcheck does not cause a crash by default (BUGCHECKFATAL=0) but instead causes the user process to exit immediately.

You might want to check your error log file for the time inverval around the most recent case of this abort, and see if any nonfatal bugchecks were logged.

A more drastic alternative, but one that's guaranteed to catch an exec-mode bugcheck in real time, is to set the sysgen parameter BUGCHECKFATAL to 1. Since this is a dynamic parameter you can change its value without rebooting.

There might be other reasons for what you're seeing. This is just the first one that came to my mind.
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Thomas,

I have not seen this behavior, however, I would start by doing a full listing of the STARTUP$STARTUP_LAYERED database to see what is being requested.

My suspicions would center around things that could cause odd problems, like quotas, particularly in command files that are executed DIRECT (as opposed to SPAWN). I would also enable the various traces for the startup process, as well as route the startup to a file rather than the console (all of which are documented in the HELP text).

I might also move PEEK/SPY startup, at least temporarily, to a later phase in the startup sequence, to somewhat mitigate collateral damage.

[Disclosure: My firm does consult on matters of this type, as do several other active contributors to this forum].

- Bob Gezelter, http://www.rlgsc.com
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

Oops. I should have made it explicit that once you set BUGCHECKFATAL to 1 and write the active parameters, an exec mode normally nonfatal bugcheck WILL cause the system to crash.

Also, to set BUGCHECKFATAL you can use either of these command sequences:

$ MCR SYSGEN
SYSGEN>USE ACTIVE
SYSGEN>SHOW BUGCHECKFATAL ! It's worth checking whether it is already set
SYSGEN>SET BUGCHECKFATAL 1
SYSGEN>WRITE ACTIVE

or

$ MC SYSMAN
SYSMAN>PARAMETER USE ACTIVE
SYSMAN>PARAMETER SHOW BUGCHECKFATAL
SYSMAN>PARAMETER SET BUGCHECKFATAL 1
SYSMAN>PARAMETER WRITE ACTIVE


Also, remember that you'll probably want to turn BUGCHECKFATAL off again. To do that just repeat either command sequence above, substituting 0 for 1.

Hope this helps,

Galen
Galen Tackett
Valued Contributor

Re: SYSMAN startup sequence aborting

I have to admit that Robert's line of thinking looks more likely than mine...

That's why people pay him money for this kind of work, whereas since April 1 I only do VMS for pleasure.

(Hmm. That wasn't meant to sound kinky or anything. :-)
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Galen,

Thank you.

Thomas,

Seriously, my rules for such infrequent problems are similar to the rules for first aid:

1) Contain the damage
2) Mitigate its impact
3) Produce useful data for analysis

With a problem that occurs with no reproduceability, it is important that we not take a step that would increase the impact.

- Bob Gezelter, http://www.rlgsc.com
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

Had to do some reboots before I got another possible reason.

If you have mode BATCH and the job terminates with Fatal and the job is retained on error (queue has retain=error) or takes long enough so that startup reaches the sync
THEN
startup will display "Batch job x terminated with error startus." and the startup procedure does an exit with Fatal. And thus the startup is aborted.

IMO not logical at all because in direct or spawn mode this is not done.

fwiw

Wim
Wim
Jon Pinkley
Honored Contributor

Re: SYSMAN startup sequence aborting

RE: SYNCHRONIZE exits with exit status of remote batch job.

Wim,

I agree that this behavior is surprising, but it has been this way for quite some time (possibly since the initial design) and it can be very useful.

However, the DCL help says nothing about this behavior, and the only way I know of to work around it is to wrap the synch statement with set noon and set on statements. And if you don't want confusing error messages in the log file, also using message_state = f$environment("MESSAGE") and set message/nofac... set message 'message_state'

The other option is to write your own synchronize using $getqui

I can understand why $getqui has the capability of returning the exit status of the synch'ed batch job; it can be very useful information to the job synchronizing on the completion. If the purpose of the synchronize is to allow a pre-requisite operation to complete, it may be necessary for the previous operation to complete successfully. This mechanism provides for the signaling of this status to any process that is waiting for its completion.

--------- begin wish list
I wish it were possible to specify the local DCL symbol for the remote status to be returned in, for example something like

$ synchronize /entry='my_entry'/remote_status=my_status ! this does not exist

This would then set the local symbol my_status with the exit status from the batch job with entry 'my_entry', and the $status symbol (and $severity) would only be set to non-success status if the entry did not exist of the process did not have access to it.

Without /remote_status, the behavior must remain the way it is, or it would break existing software.

The use of /noremote_status should cause synchronize to just throw the remote status away instead of using it to set a specified local symbol with the value. This would be the form you would use if all you wanted was synchronization, and did not care what the exit status of the batch job was.
--------- end wish list

Jon
it depends
Wim Van den Wyngaert
Honored Contributor

Re: SYSMAN startup sequence aborting

In any case, we once had 250 machines using sysman boot and to my knowledge never had the problem. Probably never had something F, just E.

That's why I prefer my own boot procedure, unix like. You can correct it when thing are not what you need.

Wim
Wim
Robert Gezelter
Honored Contributor

Re: SYSMAN startup sequence aborting

Wim,

WADR, I differ. SYSMAN startup is quite well behaved, having used it at numerous sites.

In this particular case, if the theory that the -F- error in a batch job is the issue, the problem is that the behavior is not fully documented. What is most important is that the behavior be documented.

For other reasons, I generally use SPAWN, rather than batch jobs.

- Bob Gezelter, http://www.rlgsc.com