1839158 Members
4442 Online
110136 Solutions
New Discussion

Re: Abnormalities

 
Wim Van den Wyngaert
Honored Contributor

Abnormalities

My management asked to prepare procedures to detect abnormalities that can occur during disasters.

So, the question is : what can go wrong during shutdown and startup (if possible : for which you have procedures of how to detect them. And the procedures must be made for dummies, not for system managers, if allowed, post them).

My list of things that can go wrong :

1. During a proper shutdown some component is asked to shutdown and the component doesn't come down AND blocks the requestor, t.i. the shutdown procedure itself. Seen : decnet, tcpip, sybase.

2. During startup not all disks are seen. A manual "init" is the only solution. Very rarely a power cycle is needed.

3. During boot startup is blocked by a faulty script. Seen : mounting of disks, Sybase.

4. Startup failed due to changes in quotas that were done without reboot. Seen : Sybase, dsm.

5. Mount verification timeout causes cluster members to block the mounting of the disks. Seen : server blocked by all 10 stations.

6. Licenses that are only checked at startup have terminated. Seen : decevent , dsm.

7. Due to network overload network protocols didn't start correctly. Thus the whole system is unusable. Seen : decnet, tcpip.

8. Shadow sets get corrupted because the boot was interupted at a bad time. Seen : sybase crashed with corrupt database.

9. Due to network attack, the system get 100% saturated because clients try to connect to the system. Seen : internal product WMS.


Wim
28 REPLIES 28
labadie_1
Honored Contributor

Re: Abnormalities

For 6, I would add: Shadow licence terminated...

I workarounded it, but late in the evening, I was less than pleased :-(

I would add another one, in a Cluster of 4 VAX 6440 with shadowed system disk. Entire Cluster Shutdown, and no reboot, because both copies of the system disk were missing some critical files. So avoid complete Cluster shutdown
Antoniov.
Honored Contributor

Re: Abnormalities

Wim,
no simple answer.
In my software I make a check for processes; a little procedure read process list (like show sys) and check for existence of some names.
For 7. you can check NET$ACP and LANACP (decnet plus)

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Antonio,

yes you (and I) check processes. But being there is not sufficient. How do you know it is working ? 90% of network startup problems create the process but it is simply not working (try a cluster boot of 20 stations without system disk and this in a saturated network). I wonder how long it would take if a global power failure caused ALL machines to boot (must be 10.000+ over here).

I added e.g. a test on the decnet link between a program running on all nodes and a central node. If this is missing, I get an alarm. But the alrm is only given if there is network. So, I keep the alarms in local files too.

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Like Antonio said: no simple answer.
The only sure thing you can control is system startup. There is no full control over system shutdown: power failure, crashes just happen.

So what you will need at startup is to know how the system came to a halt, and in what state. You may need to prevent actions, or just start them up becuase of that history.
This implies you should "know the system": what devices, what programs, what queues, what licenses etc.
You should also know the troubles that may arise in case of an improper shutdown, network congestion, missing (or expired) licenses and so on, and what action should be taken if this has been found the case - and what should NOT be done.

It will require quite some thought and decisions to get it all right, and quite some programming (in DCL and native programs). I've done some work on this on a number of systems - though none of them very complex - but it paid off.

A global description of these implementation is added. I cannot go into details without knowing the environment.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

A global description of these implementation is added ?

Where ?

I already told management that life is not that simple. But they don't believe me ...

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

There something between this machine and the Internet that causes quite some touble. Like loosing attechments in a second attempt after being timed-out. (Jan van de Ende knows EXACTLY what I'm taking about)

This should have it.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Antoniov.
Honored Contributor

Re: Abnormalities

Wim,
I guess your management pretend by you a piece of moon :-)

In my mind you can check active processes, then various logical names and monitor system resources.
You made a good test when check network among the nodes; I guess you can make the same testing tcp/ip checking, for example, ftp transfer.
You could run a watchdog process that check periodically process list and system quotas and this can prevent some trouble. So watchdog can:
1.Stop itself when shutdown required
2.Check disk mounted and eventually error counter
3.Same of #2
4.System quotas usign F$GETSYI function
5.No applicable; see #2
6.Check active license
7.Check comunication among nodes
8.Shadow check in #2
9.Process in RWAST or MUTEX or COM state

To be sure this work, you need 2 watchdog,
1 process at very low prio (always 1) that make all operations and 1 at realtime priority (Prio>16) that check only if low-level process is still running.

Perhaps it's better you search for any software can help you.

Antonio Vigliotti
Antonio Maria Vigliotti
Jan van den Ende
Honored Contributor

Re: Abnormalities

Wim,

Only part of your question, of course.

We have a way to prevent database shutdown hanging.
Since most databases prefer to be started and shut by their own management account anyway, we decided to do it that way for all of them, even if not required.

Create a batch job, to be submitted to an account suitable to shut down the database.

The first thing the job does, is spawn a "suicide procedure".
This procedure determines the ID of its parent, and waits a specified time.
If the batchjob finishes before that time, the subprocess is just terminated.
If the timer expires, the subprocess does STOP/ID for the parent.
The master shutdown routine synchronises for the various batchjobs, inspects the final status and if needed takes action.
That could in your case maybe CRASH the database???
--- your restart must obviously be able to cope with that anyway, because if your machine crashes hard you will also have to recover from your crash.

All you have reached by this, is preventing your database shutdown to hang for too long a time.

-- we implemented this mainly for our daily backups. We have some databases without online-backup facility, and our user organisation prefers a few minutes downtime each night over a doubtfull backup.
But when a shutdown may take so many hours that A. the database allows no new users nor transactions and B. the real backup functionality starts running at daytime, taking away much of the user performance, then you have to find a way around it.

Now our suicide routine also has the option to send a pager message to the systemmanager-of-duty, so (s)he can check the offending database, and take whatever action is required.

Bottom line: timed-delayed suicide for hanging shutdowns, if needed followed by hard ( = kill ) measures.


hth


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Ian Miller.
Honored Contributor

Re: Abnormalities

You also need application specific test scripts to be performed after system startup to check basic application function. The users are only interested in the application availability and correction function.
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

I have a different strategy. EVERYTHING must be stopped during the shutdown. I do it sequentially : applications then databases then network ... and all parts that are not reliable are done with a spawn/nowait and get a maximum number of seconds to complete. Otherwise they are ignored and shutdown continues.

Ian, you are right. But this has become very difficult in a multi-tier environment. Some programs need 10 minutes to load the database into memory, databases may have to rollback a lot, etc.

100% checking is very difficult, be glad to arrive at 50% (t.i. the main parts).

Nobody had other problems ? Or a REAL drp situation ?

I have a 10. Specific for me ?

10. If cluster nodes are unable to talk to each other using decnet, the mini-copy will not be used to copy the disks between sites. Thus a shadow copy that takes at least 1 full day will be triggered. This must be avoided.





Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Antonio,

Usefullness of any watchdog program is a _running_ system where you require immediate action if something fails.
Moreover, I think it will be one of the FIRST images to stop when shutting a system down for one, quite ovbious, reason: in whatever way you do it, the shutdown procedure implies actions that triggers a whatchdog to react. I'm quite sure that operators and system managers will NOT be happy with the result,were a watchdog running during shutdown....It's the system status AT PIONT OF SYSTEM SHUTDOWN that matters at startup.
As a system manager, you know quite well what should be running, or at least, you should. A watchdog could be helpfull, no doubt, to keep track. But I'd rather like to know what was the status at system stop. A watchdog can't tell me, so it's not a solution.

I do agree you need at least two watchdogs, but I would consider another configuration: One on THIS system to see all is well over here, and one on ANOTHER system to keep an eye on the THIS one. In your config, I won't recognize a sudden death of one machine: BOTH watchdogs gone...
Chances are BOTH go down, I can only be alarmed to have a THIRD watchdog - far from both others - to check the sanity of the other two....

Well, you can go on "ad absurdum". Two is feasable enough in most cases, three in over 99% ;-)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

We have a tripple monitoring. 1 node monitors the monitoring nodes, all 3 monitoring nodes monitor all systems and all systems monitor them self and the cluster members.

The alarms of the nodes are kept on 1 of the 3 monitoring nodes in a database, in a flat file (db down) and on the local node on a cluster shared disk. Whenever I login to the system I do a type/tail of that file.

Antonio is right about one thing : a good monitoring should have 2 processes. 1 high prio and 1 low.

Wim
Wim
Antoniov.
Honored Contributor

Re: Abnormalities

Willem,
as you posted "a system manager knows quite well what should be running" and I agree with you: watchdog can only track some values.
But Wim's managers ask him a very very hard solution for dummies; allbodies here knows it's not possible and Wim can only find some little solution.
I guess Wim ask for some idea about the reasonable test, after he's already activated various monitor applications.

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

The problem is that management wants dummies to be in charge of DRP (with paper procedures). When I say TYPE they enter TAIP.

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

As usual - there are more roads leading to Rome. It all depends on the system and what's on it. In my case, separation of application and system shutdown is based on a consideration given the nature of the beast. An orderly, controlled _and_logged_ shutdown of applications BEFORE the system is brought down, has big advatages.

The reason I put up this scheme is simple:
I could not use SUBMIT or even SPAWN/LOG in SYSHUTDOWN.COM:
- SUBMIT would be nice to log the shutdown of processes, but queues are stopped....
- SPAWN/LOG would be nice for the same reason, and fore the reasons Jan already pointed out. But interactive logins are dispabled...
So that's why stopping all user processes, programs, databases et al is done before SHUTDOWN.COM is invoked - logged, in batch.
(If anyone knows a way of doing either of these: feel free to give the hint)

As a side effect - and the way it has been set up - I also have the ability to stop it all and bring it all up again - without a system reboot. In full, or in part. Call it an "application reboot". Great if you have an application update and no time (or reason) to reboot the machine.

One shortcoming, agreed on that: SYSHUTDOWN.COM should also contain at least a (large) part of the preparation procedures as well, to be executed when needed. But run in context of SHUTDOWN: Interactively, and with ^Y and ^C enabled ;-)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Willem,

We do our shutdown in a detached processes.
ALL nodes start/stop things in the same sequence. Every node can only decide wether they want the component or not.

Batch queues are autostart and are disabled at the start of the shutdown. Procedures that do a submit/user must do so in startup$batch, a queue that is not stopped. A do a sync afterwards.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

$! REBOOT.COM
$! ==========
$! Executes a reboot in a detached process so that it can be launched from
$! any terminal, including a set host or telnet connection.
$!
$!==============================================================================
$ nodename = f$getsy ("nodename")
$ if f$mod() .eqs. "OTHER" .and. f$getj(0,"prcnam") .eqs. "-< Reboot >-"
$ then
$ GOTO do_it
$ endif
$ inp=f$env("procedure")
$ if f$mod() .eqs. "BATCH" then inp="sys$manager:reboot.com" ! bypass bug
$ run /detach sys$system:loginout -
/uic = [1,4] -
/process_name = "-< Reboot >-" -
/authorize -
/input = 'inp' -
/output = sys$common:[sysmgr]reboot_'nodename'.log
$ exit
$
$do_it:
$ set output_rate = 00:00:05
$ @sys$system:shutdown 0 "Reboot" N Y "Immediately" Y "REBOOT_CHECK,REMOVE_NODE"
$ exit

Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

As said: you can get to Rome along different roads. You can use different means of transport. You can do it fast, or slow.
Eventaully, you enter Rome.
That's what counts.

Serious:
If I'm right, your requirements are:
* Detection of problem areas duting shutdown
* detection of problem areas during startup
* ways to overcome the problems.

First of all, I would opt for PREVENTION of problems first, if possible. And certainly, if that problem would interfere with other processes on the system (for instance: your point 1)

Second, KNOW YOUR SYSTEM, You should know your devices, applicatiosn and dataabses, and their dependencies
Last, but not least: educate your programmers - and software suppliers (I know, easier said then done. But YOU are resposible for a running system.). Require methods for proper close down of images (in stead of just killing the process), require a list of dependencies. Being (half) a developer myself, I know all about it....

Obvious, isn't it?

In all cases: be aware of the dependencies. If something cannot be done because of a requirement cannot be fulfilled: bypass it.

In some issues, a (high priority) watchdog can be helpful. But there still is the problem how to inform the startup or shutdown process of the situation - and how to react.

I'll see if I can dig up some examples from the archives - when time permits.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Willem Grooters
Honored Contributor

Re: Abnormalities

To all:

Just read the comments on Wim's situation.
Well, a short play:

HOW TO RUN A F1 CAR AND WIN THE RACE

Persons:
Mr. Dummy: Slow man, shabby cloths. Barely able to ride a bicycle.
Manager: Sharply dressed man. Boss of Mr. Dummy

Location: Manager's office.
(Manager is sitting behind a huge desk. In front of him a bunch of papers.
Aside, behind a huge window with a door., a F1 car, humming).

Intercom:
"BZZZZ. Mr Dummy has arrived"

Manager:
"Show him in"

(Door opens, Dummy enters. Walks timidely to Manager's desk)

Manager:
(Pointing to F1 car)
"We want this Formula-1 car being driven to its limits. This is how to do it."

(He hands over a pile of paper to Mr. Dummy)

Manager:
You are expected to assure:
* the car is winning,
* the car arrive will in one piece
* the car arrive without damage
* the car will be able to run the next race without problems. Time after time.

You're dimissed"

(Opens door to F1 car, Shows Mr. Dummy out)

(Manager retakes his seat behind the desk)
(Mr. Dummy climbs into the F1 car and hits the road)

I won't bet on the outcome. Sorry.

A mission critical system CAN NOT, and SHOULD NOT be run by DUMMIES.
Period.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Antoniov.
Honored Contributor

Re: Abnormalities

Uh Willem,
great story but be more otpimist :-)

This is my best story:
On december 1904 two brothers make a first Human fly: about 10 seconds.
Some years after, a man drove a plane across Atlantic sea.
Some years after (today) a big airplane can fly without human touch (automatic pilot).

I agree: mission critical can't be for dummies, but it's possible leave it for some hours without the big expert. It's not for everytime but only for some hours (even system managers need sleep and holydays).

I'm optimist.

Antonio Vigliotti
Antonio Maria Vigliotti
Jan van den Ende
Honored Contributor

Re: Abnormalities

Sorry Wim, but Willem's last post forces me to answer him directly.

Willem:
you work in "Politieland" as well. Have you already seen the plans on how to run the new multi-regio data centres that are now being deployed?

If not (and to demonstrate that Wim's position is not unique, and not even too bad):
Operator staff will have ONLY to be able/authorized to do is insert a CD and click Finish. (Yes, also for VMS, Tru64 & AIX). After that they should phone into Central with eigther "Ok" or "Failure", in the last case a new CD is to be delivered "soon".

And yes, there are to be NO system managers, sysadmins, administrators, nor SAN managers on site.

And of course the SLA specifies uninterrupted 365*7 availability of all apps, be they VMS, *UX or M$.

btw: the guys at Central who are to make those CD's are the same that currently can not even create an application to be multi-user. (oh yes they can, and delivered "tested", but in the testcase multi = 3. Our testcase of multi = 15 showed incredible lock-waits, and the system it is to replace "because we need better performance" nowadays has only ever been stressed to 140 simultanuous users...

Willem, are _WE__ telling Wim to try & educate __HIS__ management? _THEY_ at least _CONSIDER_ worst-case scenario's.

Wim, you will never get there, but you are way much closer than we can ever hope to come..

'May you live in interesting times"

jpe

Don't rust yours pelled jacker to fine doll missed aches.
Willem Grooters
Honored Contributor

Re: Abnormalities

To all:

I feel I owe you all an apology.
I have lost my temper by these remarks, be happy I walked over my first anger. Yet, I have the feeling I have hit SUBMIT too fast. To my defence I'd like to say it's been the last hour of a week, filled with frustration, caused by mere stupidity.
Driving back home, I had a chance to review my sins, and came to the conclusion I'd rather ask the moderator to remove it.

My thoughts have mildered during that hour, and I can the point.
For _general_ system management tasks, proper commandprocedures executed by 'dummies' are no problem at all, if supervised (and coached) by 'professionals'.

So Wim's task is to develop this kind of procedures. Of course I'm willing to help.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Jan van den Ende
Honored Contributor

Re: Abnormalities

Willem,

DO NOT have them removed!!
Technically, you are totally and completely right!
Only, if you read more Dilbert, you will know what to expect from management.
After all, most of them are managers because they lack the ability to be good technicians.

:-(

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Antoniov.
Honored Contributor

Re: Abnormalities

Willem,
I agree with Jan: don't remove because you are right. In not perferct world we have to adjust the scenario ad our management ask to us.

Antonio Vigliotti
Antonio Maria Vigliotti