1820234 Members
3366 Online
109620 Solutions
New Discussion юеВ

Re: Abnormalities

 
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Thanks for all the comments.

In my opinion, everyone that touches the VMS systems should be certified for what he does.

An operator-dummie just typing things must be certified in recognizing things that go wrong. To recognize things that go wrong, you need a lot of VMS knowledge. Thus operator-dummies are not allowed !

E.g. operators-dummie has task of reboot. Reboot fails for 1 of the 10 reasons means system down until the expert arrives. This should be avoided. (I set shutdown$decnet_minutes to -1).

Wim

Willem,

we already had system crashes during unattended weekends. The node rebooted and all applications (mainly sybase) restarted and on Monday, only 1 failed job was found. If applications were not started automatically, it would have been a mess.

But is your real problem not that the shutdown of VMS is lawsy ? Decnet is shut before you get the hands on the system. I have bypassed all this and do the shutdown completely, decnet included and after the applications have stopped.
Wim
Matt West
Advisor

Re: Abnormalities

After skimming through the repsponses I don't recall seeing any mention of monitoring rules? The simplest thing is to script your health checks using a tool such as Robomon. If written to the right level, this should provide a quick and easy system check, without spending untold time chasing your tail.
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Matt,

We go one step further in some of applications. The applications "asks" to be monitored. This way you can stop and start the application without getting alarms (notice alarms on the screens of most monitoring systems : the users have to know which alarms to ignore). As an extra, a restart command can specify multiple nodes. Thus if a nodes fails, the monitoring system can restart it on another one.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

An old cow but something else can go wrong.

We had a tape drive giving errors. The node crashed and restarts but crashes again. This repeats itself for hours and is solved without intervention after 2 hours (scsi timeout ?). Have seen this before when having cpu problems.

Wim



Wim