1752579 Members
3164 Online
108788 Solutions
New Discussion юеВ

Re: Abnormalities

 
Antoniov.
Honored Contributor

Re: Abnormalities

Uh Willem,
great story but be more otpimist :-)

This is my best story:
On december 1904 two brothers make a first Human fly: about 10 seconds.
Some years after, a man drove a plane across Atlantic sea.
Some years after (today) a big airplane can fly without human touch (automatic pilot).

I agree: mission critical can't be for dummies, but it's possible leave it for some hours without the big expert. It's not for everytime but only for some hours (even system managers need sleep and holydays).

I'm optimist.

Antonio Vigliotti
Antonio Maria Vigliotti
Jan van den Ende
Honored Contributor

Re: Abnormalities

Sorry Wim, but Willem's last post forces me to answer him directly.

Willem:
you work in "Politieland" as well. Have you already seen the plans on how to run the new multi-regio data centres that are now being deployed?

If not (and to demonstrate that Wim's position is not unique, and not even too bad):
Operator staff will have ONLY to be able/authorized to do is insert a CD and click Finish. (Yes, also for VMS, Tru64 & AIX). After that they should phone into Central with eigther "Ok" or "Failure", in the last case a new CD is to be delivered "soon".

And yes, there are to be NO system managers, sysadmins, administrators, nor SAN managers on site.

And of course the SLA specifies uninterrupted 365*7 availability of all apps, be they VMS, *UX or M$.

btw: the guys at Central who are to make those CD's are the same that currently can not even create an application to be multi-user. (oh yes they can, and delivered "tested", but in the testcase multi = 3. Our testcase of multi = 15 showed incredible lock-waits, and the system it is to replace "because we need better performance" nowadays has only ever been stressed to 140 simultanuous users...

Willem, are _WE__ telling Wim to try & educate __HIS__ management? _THEY_ at least _CONSIDER_ worst-case scenario's.

Wim, you will never get there, but you are way much closer than we can ever hope to come..

'May you live in interesting times"

jpe

Don't rust yours pelled jacker to fine doll missed aches.
Willem Grooters
Honored Contributor

Re: Abnormalities

To all:

I feel I owe you all an apology.
I have lost my temper by these remarks, be happy I walked over my first anger. Yet, I have the feeling I have hit SUBMIT too fast. To my defence I'd like to say it's been the last hour of a week, filled with frustration, caused by mere stupidity.
Driving back home, I had a chance to review my sins, and came to the conclusion I'd rather ask the moderator to remove it.

My thoughts have mildered during that hour, and I can the point.
For _general_ system management tasks, proper commandprocedures executed by 'dummies' are no problem at all, if supervised (and coached) by 'professionals'.

So Wim's task is to develop this kind of procedures. Of course I'm willing to help.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Jan van den Ende
Honored Contributor

Re: Abnormalities

Willem,

DO NOT have them removed!!
Technically, you are totally and completely right!
Only, if you read more Dilbert, you will know what to expect from management.
After all, most of them are managers because they lack the ability to be good technicians.

:-(

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Antoniov.
Honored Contributor

Re: Abnormalities

Willem,
I agree with Jan: don't remove because you are right. In not perferct world we have to adjust the scenario ad our management ask to us.

Antonio Vigliotti
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Thanks for all the comments.

In my opinion, everyone that touches the VMS systems should be certified for what he does.

An operator-dummie just typing things must be certified in recognizing things that go wrong. To recognize things that go wrong, you need a lot of VMS knowledge. Thus operator-dummies are not allowed !

E.g. operators-dummie has task of reboot. Reboot fails for 1 of the 10 reasons means system down until the expert arrives. This should be avoided. (I set shutdown$decnet_minutes to -1).

Wim

Willem,

we already had system crashes during unattended weekends. The node rebooted and all applications (mainly sybase) restarted and on Monday, only 1 failed job was found. If applications were not started automatically, it would have been a mess.

But is your real problem not that the shutdown of VMS is lawsy ? Decnet is shut before you get the hands on the system. I have bypassed all this and do the shutdown completely, decnet included and after the applications have stopped.
Wim
Matt West
Advisor

Re: Abnormalities

After skimming through the repsponses I don't recall seeing any mention of monitoring rules? The simplest thing is to script your health checks using a tool such as Robomon. If written to the right level, this should provide a quick and easy system check, without spending untold time chasing your tail.
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

Matt,

We go one step further in some of applications. The applications "asks" to be monitored. This way you can stop and start the application without getting alarms (notice alarms on the screens of most monitoring systems : the users have to know which alarms to ignore). As an extra, a restart command can specify multiple nodes. Thus if a nodes fails, the monitoring system can restart it on another one.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Abnormalities

An old cow but something else can go wrong.

We had a tape drive giving errors. The node crashed and restarts but crashes again. This repeats itself for hours and is solved without intervention after 2 hours (scsi timeout ?). Have seen this before when having cpu problems.

Wim



Wim