Re: Detecting restart in SYSTARTUP_VMS

Jack Trachtman · ‎02-13-2005

I want to add DCL to SYSTARTUP_VMS.COM
that only runs after a crash-restart.
What can I check for that would indicate
a restart as opposed to a normal boot? TIA

David B Sneddon · ‎02-13-2005

Jack,

One solution (there are probably many) that springs
to mind:

In SYSHUTDWN, create a marker file somewhere to indicate
a normal shutdown. In a crash, this will not be created.
In SYSTARTUP_VMS, if the file exists then it was a
"normal" shutdown, if not then there was a crash.
DELETE the file in SYSTARTUP_VMS.

Regards
Dave

Mobeen_1 · ‎02-13-2005

Jack,
I thought of something like Dave suggested, you can also look at DECevent or something.

But, i was researching if there is any lexical function that can tell this. I was looking into the lexical arguments for such and still doing so. If we could get some lexical, then it will probbaly be the best solution

rgds
Mobeen

David B Sneddon · ‎02-13-2005

I probably should have added --

the best solution will likely be determined by
exactly what it is you are trying to achieve.

Dave.

Volker Halle · ‎02-13-2005

Jack,

if the system has a valid SYSDUMP.DMP file, you can check during startup, whether a new dumpfile has been written, i.e. whether the system is now just booting after a valid system crash:

- OpenVMS Alpha (V6.2 or higher):

$ directory/modify/since="''f$getsyi("boottime")'" -
clue$collect:clue$'F$GETSYI("NODENAME")'_%%%%%%_%%%%.lis
$ IF $STATUS
$ THEN
$! system is booting after a valid crash
$ ENDIF

- OpenVMS VAX (V6.0 or higher)

$ SEARCH CLUE$OUTPUT:CLUE$LAST_'F$GETSYI("NODENAME")'.LIS "Operator Shutdown on Node"
$ IF $STATUS .EQ. %X08D78053 ! %SEARCH-I-NOMATCHES
$ THEN
$! system is booting after a valid crash
$ ENDIF

This logic depends on the fact that a new 'CLUE file' is being created by the OpenVMS system startup procedures. It has been successfully used to automatically log/mail system crash information.

Volker.

Jan van den Ende · ‎02-13-2005

Jack,

actually, _WE_ want to be triggered earlier.

On our cluster, every node is pretty frequently checking the still-presence of all other know nodes.

If one is missing, immediately a pager call is sent to the stand-by system manager.

Of course, you do not want that on a scheduled shutdown.

Our implementation of this:
create a clusterwide logical name with a list of the active cluster nodes.
In SYSHUTDWN remove the shutting node from that list.
Also, to prevent a continuous stream of pager calls, remove that name after sending a pager call.

Potential caveat: who will signal if the whole site goes down?
For us no heavy worry, because our cluster is multi-site.
A bigger worry is the fact that the pager mechanism depends on the phone being available & functioning!

fwiw,

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Mobeen_1 · ‎02-13-2005

Jan,
This works on a Cluster, but what happens if the VMS system that Jack is trying to implement is a stand-alone server and not part of a VMS cluster.

regards
Mobeen

Jan van den Ende · ‎02-13-2005

Mobeen,

that's correct, this is a cluster-only solution.

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Mobeen_1 · ‎02-13-2005

All,
Does any one know of a lexical function that can tell me if the server was shutdown normally or if it crashed ?

I have searched for a long time without any luck :), if there is such a thing, then that should help Jack

rgds
Mobeen

Volker Halle · ‎02-13-2005

Mobeen,

there is no lexical function or anything similar and simple designed into OpenVMS to find out, if the system is booting after a crash or a shutdown.

Note that there also is no possibility to find out, if someone has just done an external halt and boot (or hit the RESTART button).

You could also try SEARCH OPERATOR.LOG;-1 for the string "'nodename' shutdown was requested by the operator"

Or you could try to look at the most recent errors in ERRLOG.SYS and look for a bugcheck errlog entry.

Looking for the CLUE file was the best method we could come up with, when we created the AutoCLUE (later called CCAT) startup command procedures for automatic crash call logging.

Volker.

Jack Trachtman · ‎02-14-2005

I remember reading years ago that when the
ANALYZE/CRASH cmd executes and finds a valid
dump file, the first thing it does is to change something in the first block of the dump file so that a subsequent crash will not cause a reanalysis of the file.

If I'm recalling this correctly, does any know what is being changed in the dump file for a crash-restart indication?

Volker Halle · ‎02-14-2005

Jack,

sure, the OLDDUMP bit is set, once SDA has initially opened a system dump file.

From SYS$LIBRARY:LIB.REQ:

macro DMP$V_OLDDUMP = 4,0,1,0 %; ! SET IF DUMP ALREADY ANALYZED

This is the BLISS structure definition and maps to bit 0 of the 2nd longword in SYSDUMP.DMP.

On OpenVMS Alpha, CLUE$SDA evaluates this bit and responds with a

%CLUE-I-ALRDYANA, dumpfile has already been analyzed

message (see SYS$MANAGER:CLUE$STARTUP_node.LOG), when executing the CLUE HISTORY command during startup.

Volker.

Jan van den Ende · ‎02-14-2005

Jack,

Volker's explanation IS correct to one side:
_IF_ you have a valid new dump, _THEN_ this is the first reboot after a crash.
But, you can absolutely _NOT_ conclude to the reverse: a previously-read SYSDUMP does _NOT_ imply that the previous shutdown was operator-requested. It might well be, but it can also indicate that for any reason a dump was not/could not be written.

^P immediately comes to mind, but consider this scenario: (we were bitten by it).
Two nodes, connected by one SCSI-bus to each other and to HSZ40 controllers.
In hindsight, one of the SCSI connector cables to one node was broken, but normally the broken edges touched. At irregular intervals (by vibration or temperature change probably) connection got interrupted.
That node crashed, but... no connection to the disks. And that constitutes a fairly strong reason for not writing a dump to disk!

As a thought experiment, it is rather easy to construct configurations and/or power issues that somehow break at a point or a moment that _WILL_ prevent writing the dump.

So, _NO_ valid dump does _NOT_ imply operator requested shutdown!

hth

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Jack Trachtman · ‎02-14-2005

I appreciate everyone reminding me that I can't cover all possible scenarios, but at least catching a crash-restart is helpful.

Volker,

I took your info:

...maps to bit 0 of the 2nd longword in SYSDUMP.DMP

and tried to confirm this by adding the following line to SYSTARTUP_VMS both before and after the ANA/CRASH statements:

$ dump/blocks=end:1 sysdump.dmp

but the output looked exactly the same!
Shouldn't I have been able to see the bit
in the longword being toggled? Am I doing
something simple wrong in my test?

Jan van den Ende · ‎02-14-2005

Jack,

all-in-all I tend to the idea that David's suggestion comes closest to consistency.

Maybe, there are 2 issues (one, a matter of policy, the other, rather low probability) that I can think of:

1. How about operator requested shutdown requesting _NOT_ to execute site-specific shutdown?
You either equate that to a crash, OR you fiddle around with SHUTDOWN.COM, and be prepared for some consequences at upgrades.

2. The system COULD potentially crash AFTER writing your "shutdown requested" file.
This would present a crash-during-shutdown as "operator requested shutdown". But in this case, the executing operator might well have noted "something" ??? And, here the "fresh dump check" might still help..

Still,
far out the best consistency as far as I can reckon.

hth,

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Lawrence Czlapinski · ‎02-14-2005

Jack, unfortunately we have seen cases where the system crashed without a crash dump.
Is this for a cluster?
For our clusters, we use a CLUSTER_MONITOR.COM which can be used to trigger a pager through DECtalk. However, this doesn't tell you whether there was crash or not.

You may have to implement something realizing that it won't work all the time.
Writing a file at the end of the site specific operator shutdowns would filter out a lot of the shutdowns. Occasionally a system could crash after the file is written but that would normally be a low probability. Over time you would try to cover more shutdowns. Depending on a system dump being written is riskier. As others have stated there can be assorted reasons why a dump isn't written after a crash.
Lawrence

Volker Halle · ‎02-14-2005

Jack,

CLUE$STARTUP.COM is run earlier than SYSTARTUP_VMS.COM and will automatically analyze the crash for you (creating a CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS file).

So if you really want to see the OLDDUMP bit clear, you need to add your DUMP command to SYLOGICALS.COM. In SYSTARTUP_VMS.COM, it will be too late...

May I repeat my suggestion to look at OPERATOR.LOG;-1 for the 'node' shutdown string (to see what I'm referring to, just try a TYPE/TAIL SYS$MANAGER:OPERATOR.LOG;-1 if the system has just been rebooted). If you want to known, if the system had been shut down normally, this is the place to look. You need to substitute the local nodename in the SEARCH command to not match shutdown messages from other nodes, i.e.:

$ SEARCH SYS$MANAGER:OPERATOR.LOG;-1 -
" ''F$GETSYI("NODENAME")' shutdown was requested by the operator."

If you include /WINDOW=(2,0) you'll also see when the shutdown happend and which user had done it.

Volker.

Wim Van den Wyngaert · ‎02-14-2005

Also note that e.g. memory corruption may make your node hang instead of crash.

I first only checked if clue created an analysis file shortly before the reboot. If yes, this was a crash. But I had multiple cases were the system simply hanged. So, now I have a site manager node pinging the important parts (node, hubs, sanswitch) and generating alarms when they are not seen. And we have a permanent decnet link between the node and the site manager node. If that link is dropped we also get an alarm.

But of course these alarms are also given during a normal shutdown, in which case the operational guys are warned.

Wim

Wim

Robert_Boyd · ‎02-15-2005

I checked out your suggestion Volker -- apparently what you suggested won't work on every system. I checked several Alphas running V7.2-1 and V7.3-2 here and the ones I checked do NOT have the "shutdown requested" string in the OPERATOR.LOG file. This seems a bit odd to me, but in any case -- the important thing is that there is the "Logfile was closed by operator" message. It is possible to search for that instead. However, that message does not guarantee that the closing of the file was for the shutdown. If however you're checking the end of the ;-1 version during a boot, I think that would be a reliable check.

Another way that I have handled a similar requirement in the past was to create a detached process that runs all the time in the background. It wakes up every few seconds and does 2 things -- it updates a timestamp in a file, and it checks for the appearance of the logical name SHUTDOWN$TIME. SYS$SYSTEM:SHUTDOWN.COM defines the logical name in the system table when a shutdown is in progress. The detached process was set up to check to see how close to the shutdown time it was and when it got to under 1 minute to do whatever final steps were needed to mark the shutdown and then exit. Then on system startup the code checked to see how much time had elapsed and if a shutdown occurred. It then wrote a log record recording the downtime. If the downtime was not associated with a shutdown then a crash or system hang was assumed and logged appropriately ( with an accompanying email message to the system managers).

Robert

Master you were right about 1 thing -- the negotiations were SHORT!

Dale A. Marcy · ‎02-15-2005

Robert,

Please check again. I just tried the command on an AlphaServer 4100 5/400 running VMS V7.3-2. At first it didn't work, and then I typed the tail of the log and noticed that I left the word "the" out of the search string. I repeated the command again adding the missing "the" into the sentence and it worked as advertised.

Volker Halle · ‎02-15-2005

Robert,

did you check on systems that were recently rebooted and had not yet created a new version of OPERATOR.LOG (with REPLY/LOG) ?

I've checked 3 systems here (VAX, Alpha, I64) and they all show the shutdown message at the bottom of OPERATOR.LOG.

Volker.

Robert_Boyd · ‎02-16-2005

Here's what I found out -- it is possible through the use of OPC$LOGFILE_CLASSES to control which messages do and don't go into the OPERATOR log. It is possible to have a setup on one node in a cluster where not all classes of messages go into the log. I've used this especially on satellite systems to limit the used of disk space for log files -- leaving most of that to the boot servers where most of the activity is logged. Also, sometimes I've set systems up so that network logging is restricted to being handled by the DECnet and TCPIP internal logging facilities in separate logfiles. Also it doesn't always make sense to be chewing up disk space for audit journals AND sending security events to the plain text operator log. I prefer to have the security events only in the audit journal.

In any case, in the default configuration, it appears that the shutdown requested messages will be at the tail end of the operator log file if there is a normal shutdown sequence.

I wonder what shows up in the log file if you just run OPCCRASH without going through the usual shutdown sequence?

Master you were right about 1 thing -- the negotiations were SHORT!

Volker Halle · ‎02-16-2005

Robert,

the 'shutdown was requested by the operator' message is explicitly sent from SHUTDOWN.COM, so there will be nothing in OPERATOR.LOG if you stop/crash the system using OPCCRASH.

Volker.

Ian Miller. · ‎02-16-2005

there will be a entry in the error log which is operator requested shutdown for normal shutdown or opccrash

____________________
Purely Personal Opinion

Jack Trachtman · ‎02-16-2005

I've decided to go with David's suggestion.

At the beginning of SYSHUTDWN, I create a 1-line file with the node name, time-stamp, & short msg. At the end of SYSHUTDWN, I append another similar msg (to indicate that SYSHUTDWN has run to its end).

In SYSTARTUP, if the file exists I display its contents and rename it. If it doesn't exist, I send e-mail/page.

The time stamp in the file will let me show amount of down time (if desired) for a regular shutdown.

Thanks to everyone.

This thread can be closed.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Detecting restart in SYSTARTUP_VMS

Detecting restart in SYSTARTUP_VMS