1753914 Members
9165 Online
108810 Solutions
New Discussion юеВ

Abnormalities

 
Wim Van den Wyngaert
Honored Contributor

Abnormalities

My management asked to prepare procedures to detect abnormalities that can occur during disasters.

So, the question is : what can go wrong during shutdown and startup (if possible : for which you have procedures of how to detect them. And the procedures must be made for dummies, not for system managers, if allowed, post them).

My list of things that can go wrong :

1. During a proper shutdown some component is asked to shutdown and the component doesn't come down AND blocks the requestor, t.i. the shutdown procedure itself. Seen : decnet, tcpip, sybase.

2. During startup not all disks are seen. A manual "init" is the only solution. Very rarely a power cycle is needed.

3. During boot startup is blocked by a faulty script. Seen : mounting of disks, Sybase.

4. Startup failed due to changes in quotas that were done without reboot. Seen : Sybase, dsm.

5. Mount verification timeout causes cluster members to block the mounting of the disks. Seen : server blocked by all 10 stations.

6. Licenses that are only checked at startup have terminated. Seen : decevent , dsm.

7. Due to network overload network protocols didn't start correctly. Thus the whole system is unusable. Seen : decnet, tcpip.

8. Shadow sets get corrupted because the boot was interupted at a bad time. Seen : sybase crashed with corrupt database.

9. Due to network attack, the system get 100% saturated because clients try to connect to the system. Seen : internal product WMS.


Wim
28 REPLIES 28
labadie_1
Honored Contributor

Re: Abnormalities

For 6, I would add: Shadow licence terminated...

I workarounded it, but late in the evening, I was less than pleased :-(

I would add another one, in a Cluster of 4 VAX 6440 with shadowed system disk. Entire Cluster Shutdown, and no reboot, because both copies of the system disk were missing some critical files. So avoid complete Cluster shutdown
Antoniov.
Honored Contributor

Re: Abnormalities

Wim,
no simple answer.
In my software I make a check for processes; a little procedure read process list (like show sys) and check for existence of some names.
For 7. you can check NET$ACP and LANACP (decnet plus)

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Antonio,

yes you (and I) check processes. But being there is not sufficient. How do you know it is working ? 90% of network startup problems create the process but it is simply not working (try a cluster boot of 20 stations without system disk and this in a saturated network). I wonder how long it would take if a global power failure caused ALL machines to boot (must be 10.000+ over here).

I added e.g. a test on the decnet link between a program running on all nodes and a central node. If this is missing, I get an alarm. But the alrm is only given if there is network. So, I keep the alarms in local files too.

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Like Antonio said: no simple answer.
The only sure thing you can control is system startup. There is no full control over system shutdown: power failure, crashes just happen.

So what you will need at startup is to know how the system came to a halt, and in what state. You may need to prevent actions, or just start them up becuase of that history.
This implies you should "know the system": what devices, what programs, what queues, what licenses etc.
You should also know the troubles that may arise in case of an improper shutdown, network congestion, missing (or expired) licenses and so on, and what action should be taken if this has been found the case - and what should NOT be done.

It will require quite some thought and decisions to get it all right, and quite some programming (in DCL and native programs). I've done some work on this on a number of systems - though none of them very complex - but it paid off.

A global description of these implementation is added. I cannot go into details without knowing the environment.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

A global description of these implementation is added ?

Where ?

I already told management that life is not that simple. But they don't believe me ...

Wim
Wim
Willem Grooters
Honored Contributor

Re: Abnormalities

Wim,

There something between this machine and the Internet that causes quite some touble. Like loosing attechments in a second attempt after being timed-out. (Jan van de Ende knows EXACTLY what I'm taking about)

This should have it.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Antoniov.
Honored Contributor

Re: Abnormalities

Wim,
I guess your management pretend by you a piece of moon :-)

In my mind you can check active processes, then various logical names and monitor system resources.
You made a good test when check network among the nodes; I guess you can make the same testing tcp/ip checking, for example, ftp transfer.
You could run a watchdog process that check periodically process list and system quotas and this can prevent some trouble. So watchdog can:
1.Stop itself when shutdown required
2.Check disk mounted and eventually error counter
3.Same of #2
4.System quotas usign F$GETSYI function
5.No applicable; see #2
6.Check active license
7.Check comunication among nodes
8.Shadow check in #2
9.Process in RWAST or MUTEX or COM state

To be sure this work, you need 2 watchdog,
1 process at very low prio (always 1) that make all operations and 1 at realtime priority (Prio>16) that check only if low-level process is still running.

Perhaps it's better you search for any software can help you.

Antonio Vigliotti
Antonio Maria Vigliotti
Jan van den Ende
Honored Contributor

Re: Abnormalities

Wim,

Only part of your question, of course.

We have a way to prevent database shutdown hanging.
Since most databases prefer to be started and shut by their own management account anyway, we decided to do it that way for all of them, even if not required.

Create a batch job, to be submitted to an account suitable to shut down the database.

The first thing the job does, is spawn a "suicide procedure".
This procedure determines the ID of its parent, and waits a specified time.
If the batchjob finishes before that time, the subprocess is just terminated.
If the timer expires, the subprocess does STOP/ID for the parent.
The master shutdown routine synchronises for the various batchjobs, inspects the final status and if needed takes action.
That could in your case maybe CRASH the database???
--- your restart must obviously be able to cope with that anyway, because if your machine crashes hard you will also have to recover from your crash.

All you have reached by this, is preventing your database shutdown to hang for too long a time.

-- we implemented this mainly for our daily backups. We have some databases without online-backup facility, and our user organisation prefers a few minutes downtime each night over a doubtfull backup.
But when a shutdown may take so many hours that A. the database allows no new users nor transactions and B. the real backup functionality starts running at daytime, taking away much of the user performance, then you have to find a way around it.

Now our suicide routine also has the option to send a pager message to the systemmanager-of-duty, so (s)he can check the offending database, and take whatever action is required.

Bottom line: timed-delayed suicide for hanging shutdowns, if needed followed by hard ( = kill ) measures.


hth


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Ian Miller.
Honored Contributor

Re: Abnormalities

You also need application specific test scripts to be performed after system startup to check basic application function. The users are only interested in the application availability and correction function.
____________________
Purely Personal Opinion