1748194 Members
4296 Online
108759 Solutions
New Discussion юеВ

Re: What is allowed ?

 
Wim Van den Wyngaert
Honored Contributor

What is allowed ?

I have a discussion with my management.

The question is "what interventions are allowed during working hours on a critical production system".

To make it more specific : it is a interbuilding config of 2 GS160 with each 2 qbb that form 2 clusters (1 cluster uses 50% of each GS160). Double hsg80.

On application level, 1 node can be brought down in 1 cluster but in the other cluster this must be avoided. Thus an intervention on the node that is down can have impact on the other node in that GS160.

Your opinions please ...

Wim

Wim
30 REPLIES 30
labadie_1
Honored Contributor

Re: What is allowed ?

I would say that, as long as it has NO impact on the production, all is allowed. So speaking simply, very little will be allowed.

I worked on a site (a bank), where new stuff (a simple modified .com submitted several times a day, for example) would be put in production only after at least 3 days test on a test system, and not on Friday (as it may impact the week-end batches), not on Thursday (as it may impact the Friday), but only on Monday to Wednesday.

It is redundant to say that the production is important and must not be impacted, no ?

Uwe Zessin
Honored Contributor

Re: What is allowed ?

If you want to go safe - nothing is allowed. Even peeking into the system can have a fatal effect (there once was a bug in the SHOW CLUSTER utility that could cause a system crash).

Changing startup-related command procedures can have unwanted side effects, too. The only way to test them is to reboot. If you change, but delay the test you might experience an unexpected reboot which _will_ test your changes at a time you are not prepared.
.
Wim Van den Wyngaert
Honored Contributor

Re: What is allowed ?

For those with a GS160 too :
1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole GS160

And what else went wrong that was a no risk ?
Wim
Jan van den Ende
Honored Contributor

Re: What is allowed ?

wim,

Taking Uwe one step further:
If you want to be totally safe, then you should power down the system, and bury it in concrete.
But: somehow I do not think this will give a good return on investment.

Conclusion: what _HAS TO BE ALLOWED_ is the optimum (best try, or best guess perhaps) between risk and work.
Consider this: if you are not allowed anything, how will you notice, let alone repair, ANY kind of runaway resource use?
Remember: ALL software was ultimately human-made, and it runs on (if nothing else: because if quantum mechanics) somewhat imperfect hardware, so there ALWAYS will be a non-zero probability that at some time something WILL behave unplanned.
And THIS is where you need someone (NOT a drone, like the ones you indicated in a previous thread) to BE ABLE to notice, to actualy DO take notice, to UNDERSTAND what is noticed, to have sufficient knowledge to JUDGE what is noticed, be able to DETERMINE (the need for?) a corrective action, and be able and ALLOWED to (timely) take the necessary action.

Perhaps we should pick another concept of quantum mechanics:
the very act of OBSERVING a system DOES infuence that system, but without, most parameters of the system are UNDETERMINED.

THERE SIMPLY IS NO PERFECT SOLUTION!

And only someone who thoroughly knows the applicable business rules can define a configuration that comes close to meeting them.
The price will always be way above budget, and then there will have to be an evaluation of cost vs risk.

It is up to you, and to your professional qualities, to specify WHAT can be done, at WHAT cost. It is up to you to make full clear to responsible management WHAT are the variables in this, and HOW they influence one another (and I full well know that this is the most difficult!).
And then, it is up to them to decide WHAT they want to spend, RESULTING in what risks THEY accept.

In your configuration, doing maintenance one one node of one cluster possibly interfering with the operation of the second cluster may not be the best solution.
Then again: if there is a sufficient amount of non-"working hours", then you may have to decide that maintenance will have to be done during those hours.
Which of course will influence the cost: back to square one.

But THE REAL ISSUE in our work will be to have management KNOW in advance WHAT they get, and what they WILL NOT get. And there, as in any insurance issue, it will always be too expensive, until it is too late, then there will be too little coverage.

The whole issue boils down to finding the right balance, but only very few people are true equilibrists....


"May you live in interesting times"


fwiw,


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Wim,



1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole


We obviously are running different environments:
Your questions imply that you CAN also do these actions with systems down.

In our 365.25 * 24 configuration, we CAN take down some system, but only AFTER we make doubly sure that the redundant component(s) ARE able to take the full load. That includes pre-outreconfiguring the temporarily leaving system/controller/switch/....

For instance, changing batteries in a dual controller HAS to be done during operating hours, if only because the batteries do not last 7+ years.

In our case a major GOAL of dual (HSG), or triple (shadow sets) or even quadruple (nodes) redundancy IS to be able to do such hardware maintenance while remaining available (with sufficient capacity).


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Willem Grooters
Honored Contributor

Re: What is allowed ?

Agreed with Jan - he's the true expert here...
But I would put it even further - not that it would help you immediately, I think - to the point where it is decieded to use IT (to assist in) running your business. It's THENM that a risk estamations should be made, to find out WHAT IF... Not only unplanned downtime (AKA crashes - that, as Jan scientifically explaned, can simply NOT be prevented) but planned downtime as well, which can, partly, be determined based on statistical evidence of (non-)robustness and weariness of hardware (MTBF, to name one), multiple usage (tape!) and software (OS, LP, applications), but also on (quite regular) monitoring of all components. Not to prevent all crashes, but most of them.
That will lead to the whole picture of the configuration, maintenance windows, risks, prevention - all the components of distaster tolerance and -prevention.

That being said, it is obvious that monitoring the systems is simply a requirement - and MUST be allowed. System updates - OS, LP, Applications) must indeed be tested, indeed on a separate system to prevent interaction with a production system, but that will only prove that no problem exists ON THAT TEST ENVIRONMENT. It won't be the first time that installing the updates in a production system would cause a problem there - even if it has been tested thoroughly on another system (nietwaar, Jan?)

Luckily, I would say, it's a VMS environment.
Hopefully, fully redundant - in which case interference can be prevented or at least minimized. As Jan already pointed out (while I was typing this...)

Willem
Willem Grooters
OpenVMS Developer & System Manager
Uwe Zessin
Honored Contributor

Re: What is allowed ?

Jan, don't be silly...
we're talking about doing maintenance on a working system. A system that is powered down does not do any work. In that case you can do all maintenance that you want ;-)
.
Willem Grooters
Honored Contributor

Re: What is allowed ?

Uwe,

In a non-redundant environment you have to be aware that exchanging elements (whatever it may be) simply requires shutdown.
The more redundance you build into your environment (based on your risk analysis, or just because you like it (and have money)) the likelier it becomes you can get on uninterrupted while exchanging hard- and software. That's what VMS clustering (and NSK) is all about.
In Jan's case, it IS a running, 24x356.25 system. The way it is set up makes it possible to have at least ONE machine running all, or most required applications, while other node(s) are being maintained. That's why ALL hardware and software could be upgraded without closing down the cluster for over 7 years. I guess that's what he meant to say: Have ONE machine down and let the other continue - in Wim's case: the one-side halves of two clusters. The question is - and that's a matter of risk analysis - if that than single-node cluster is safe enough.
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: What is allowed ?

Ok, this is all theory. But what do YOU allow on your systems.

I saw no risks interventions terminate badly :

1) controller battery change : disks thrown out of the shadow set, controller bugcheck, controller power cycle (I know, shadow_mbr_tmo was 2 minutes and the controller gives 3 minutes to the technichian to do the intervention)

2) controller boot : disk thrown out of the shadow set

3) fiber change : the new one didn't work and when putting back the old one, this didn't work either

4) power cycle and disks get broken

etc
Wim