1752805 Members
5526 Online
108789 Solutions
New Discussion юеВ

Re: Abnormalities

 
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

I have a different strategy. EVERYTHING must be stopped during the shutdown. I do it sequentially : applications then databases then network ... and all parts that are not reliable are done with a spawn/nowait and get a maximum number of seconds to complete. Otherwise they are ignored and shutdown continues.

Ian, you are right. But this has become very difficult in a multi-tier environment. Some programs need 10 minutes to load the database into memory, databases may have to rollback a lot, etc.

100% checking is very difficult, be glad to arrive at 50% (t.i. the main parts).

Nobody had other problems ? Or a REAL drp situation ?

I have a 10. Specific for me ?

10. If cluster nodes are unable to talk to each other using decnet, the mini-copy will not be used to copy the disks between sites. Thus a shadow copy that takes at least 1 full day will be triggered. This must be avoided.





Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Antonio,

Usefullness of any watchdog program is a _running_ system where you require immediate action if something fails.
Moreover, I think it will be one of the FIRST images to stop when shutting a system down for one, quite ovbious, reason: in whatever way you do it, the shutdown procedure implies actions that triggers a whatchdog to react. I'm quite sure that operators and system managers will NOT be happy with the result,were a watchdog running during shutdown....It's the system status AT PIONT OF SYSTEM SHUTDOWN that matters at startup.
As a system manager, you know quite well what should be running, or at least, you should. A watchdog could be helpfull, no doubt, to keep track. But I'd rather like to know what was the status at system stop. A watchdog can't tell me, so it's not a solution.

I do agree you need at least two watchdogs, but I would consider another configuration: One on THIS system to see all is well over here, and one on ANOTHER system to keep an eye on the THIS one. In your config, I won't recognize a sudden death of one machine: BOTH watchdogs gone...
Chances are BOTH go down, I can only be alarmed to have a THIRD watchdog - far from both others - to check the sanity of the other two....

Well, you can go on "ad absurdum". Two is feasable enough in most cases, three in over 99% ;-)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

We have a tripple monitoring. 1 node monitors the monitoring nodes, all 3 monitoring nodes monitor all systems and all systems monitor them self and the cluster members.

The alarms of the nodes are kept on 1 of the 3 monitoring nodes in a database, in a flat file (db down) and on the local node on a cluster shared disk. Whenever I login to the system I do a type/tail of that file.

Antonio is right about one thing : a good monitoring should have 2 processes. 1 high prio and 1 low.

Wim
Wim
Antoniov.
Honored Contributor

Re: Abnormalities

Willem,
as you posted "a system manager knows quite well what should be running" and I agree with you: watchdog can only track some values.
But Wim's managers ask him a very very hard solution for dummies; allbodies here knows it's not possible and Wim can only find some little solution.
I guess Wim ask for some idea about the reasonable test, after he's already activated various monitor applications.

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

The problem is that management wants dummies to be in charge of DRP (with paper procedures). When I say TYPE they enter TAIP.

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

As usual - there are more roads leading to Rome. It all depends on the system and what's on it. In my case, separation of application and system shutdown is based on a consideration given the nature of the beast. An orderly, controlled _and_logged_ shutdown of applications BEFORE the system is brought down, has big advatages.

The reason I put up this scheme is simple:
I could not use SUBMIT or even SPAWN/LOG in SYSHUTDOWN.COM:
- SUBMIT would be nice to log the shutdown of processes, but queues are stopped....
- SPAWN/LOG would be nice for the same reason, and fore the reasons Jan already pointed out. But interactive logins are dispabled...
So that's why stopping all user processes, programs, databases et al is done before SHUTDOWN.COM is invoked - logged, in batch.
(If anyone knows a way of doing either of these: feel free to give the hint)

As a side effect - and the way it has been set up - I also have the ability to stop it all and bring it all up again - without a system reboot. In full, or in part. Call it an "application reboot". Great if you have an application update and no time (or reason) to reboot the machine.

One shortcoming, agreed on that: SYSHUTDOWN.COM should also contain at least a (large) part of the preparation procedures as well, to be executed when needed. But run in context of SHUTDOWN: Interactively, and with ^Y and ^C enabled ;-)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

We do our shutdown in a detached processes.
ALL nodes start/stop things in the same sequence. Every node can only decide wether they want the component or not.

Batch queues are autostart and are disabled at the start of the shutdown. Procedures that do a submit/user must do so in startup$batch, a queue that is not stopped. A do a sync afterwards.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

$! REBOOT.COM
$! ==========
$! Executes a reboot in a detached process so that it can be launched from
$! any terminal, including a set host or telnet connection.
$!
$!==============================================================================
$ nodename = f$getsy ("nodename")
$ if f$mod() .eqs. "OTHER" .and. f$getj(0,"prcnam") .eqs. "-< Reboot >-"
$ then
$ GOTO do_it
$ endif
$ inp=f$env("procedure")
$ if f$mod() .eqs. "BATCH" then inp="sys$manager:reboot.com" ! bypass bug
$ run /detach sys$system:loginout -
/uic = [1,4] -
/process_name = "-< Reboot >-" -
/authorize -
/input = 'inp' -
/output = sys$common:[sysmgr]reboot_'nodename'.log
$ exit
$
$do_it:
$ set output_rate = 00:00:05
$ @sys$system:shutdown 0 "Reboot" N Y "Immediately" Y "REBOOT_CHECK,REMOVE_NODE"
$ exit

Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

As said: you can get to Rome along different roads. You can use different means of transport. You can do it fast, or slow.
Eventaully, you enter Rome.
That's what counts.

Serious:
If I'm right, your requirements are:
* Detection of problem areas duting shutdown
* detection of problem areas during startup
* ways to overcome the problems.

First of all, I would opt for PREVENTION of problems first, if possible. And certainly, if that problem would interfere with other processes on the system (for instance: your point 1)

Second, KNOW YOUR SYSTEM, You should know your devices, applicatiosn and dataabses, and their dependencies
Last, but not least: educate your programmers - and software suppliers (I know, easier said then done. But YOU are resposible for a running system.). Require methods for proper close down of images (in stead of just killing the process), require a list of dependencies. Being (half) a developer myself, I know all about it....

Obvious, isn't it?

In all cases: be aware of the dependencies. If something cannot be done because of a requirement cannot be fulfilled: bypass it.

In some issues, a (high priority) watchdog can be helpful. But there still is the problem how to inform the startup or shutdown process of the situation - and how to react.

I'll see if I can dig up some examples from the archives - when time permits.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Willem Grooters
Honored Contributor

Re: Abnormalities

To all:

Just read the comments on Wim's situation.
Well, a short play:

HOW TO RUN A F1 CAR AND WIN THE RACE

Persons:
Mr. Dummy: Slow man, shabby cloths. Barely able to ride a bicycle.
Manager: Sharply dressed man. Boss of Mr. Dummy

Location: Manager's office.
(Manager is sitting behind a huge desk. In front of him a bunch of papers.
Aside, behind a huge window with a door., a F1 car, humming).

Intercom:
"BZZZZ. Mr Dummy has arrived"

Manager:
"Show him in"

(Door opens, Dummy enters. Walks timidely to Manager's desk)

Manager:
(Pointing to F1 car)
"We want this Formula-1 car being driven to its limits. This is how to do it."

(He hands over a pile of paper to Mr. Dummy)

Manager:
You are expected to assure:
* the car is winning,
* the car arrive will in one piece
* the car arrive without damage
* the car will be able to run the next race without problems. Time after time.

You're dimissed"

(Opens door to F1 car, Shows Mr. Dummy out)

(Manager retakes his seat behind the desk)
(Mr. Dummy climbs into the F1 car and hits the road)

I won't bet on the outcome. Sorry.

A mission critical system CAN NOT, and SHOULD NOT be run by DUMMIES.
Period.

Willem
Willem Grooters
OpenVMS Developer & System Manager