1752795 Members
5640 Online
108789 Solutions
New Discussion юеВ

Re: What is allowed ?

 
Uwe Zessin
Honored Contributor

Re: What is allowed ?

Willem,

I have been managing VMS clusters since 1986 - I know what redundancy is!

Talking about powering a system down and burying it in concrete as a response to my comment - I don't feel like treated serious.
.
Uwe Zessin
Honored Contributor

Re: What is allowed ?

Well, Wim, doesn't that prove my point? Even interventions that seem without risk can go wrong for various reasons.

I have never worked in an environment that was that demanding like yours, but we didn't have any detailed rules of what was allowed and what not. It was decided on a case by case and the user base was much smaller. You have a highly complex system and I don't think you will ever be able to make a final list.
.
labadie_1
Honored Contributor

Re: What is allowed ?

Wim

on a site, we did 2 controller battery change on a HSD. The first one went fine, but the Compaq guy who did the second was not as good, and needed more than 2 minutes (in fact about 20 minutes :-(

So a controller battery change must be seen as a dangerous operation.

I think the simplest attitude is to plan any change, when there is no activity and we have some free time to repair. Yes this often means coming on Saturday afternoon or evening, work a good part of the night, and have a complete sunday if something goes wrong.
Martin P.J. Zinser
Honored Contributor

Re: What is allowed ?

Hello Wim,

we have a simple rule: You do not have any
planned maintenance on the HW during production hours.

OTOH we are not yet trading 24h a day, so we have a window to do what we need to do.

Greetings, Martin
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Uwe,

" Talking about powering a system down and burying it in concrete as a response to my comment - I don't feel like treated serious. "

- I did not in least try to not take you seriously, far from it.

Actually, it was just a blind repeat of the starting line from (training & symposium) various sessions on the broad subject, both from my former life in chemistry, as in IT.

Actually, it is more or less the description of what you WOULD need if you ever intended to specify a system for Orange Book "A" certification.

Maybe you should take it as description of the utter limit, immediately showing the implied irrelevance.



Wim.

on changing HSx batteries:
We had dual-redundant HSZ40's, and now have dual-redundant HSG80's.... on redundant sites.
HBVS over both.

_IF_ on changing the batteries of one HS on one site, SOMEHOW things get messed up, and both leave service (has happened to us also, yes), THEN we fall back to reduced shadow sets from one site only, and the penalty will be shadow merge (finally, Engeneering, THANKS for HBMM...as soon as we hear enough from others to dare)
BUT: bottom line: we do NOT LOOSE service. (and next time we ask for a more experienced engeneer!)


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Lawrence Czlapinski
Trusted Contributor

Re: What is allowed ?

Wim, we run 24X7.
First rule is to keep the systems and applications running.
Second rule is if something goes wrong to notify the appropriate persons and get things working as soon as feasible. A lot depends on the system admins and management.
Our team of 2 system admins has a lot of leeway. Also our applications people have a lot of leeway.
1. What is not allowed?
A. Doing most hardware maintenance. We don't change live network connections connected to production systems unless it is needed to attempt fixing a critical problem.
On some sites, developers aren't allowed to make changes on a production system except during approved maintenance.
Exceptions:
We can swap an external tape drive. (This normally doesn't make things any worse. The tape drive is down already. This is also an advantage of external tape drives.)
We can swap a failed member of a non-raid shadowed disk drive. (Admin still has to ensure that the right disk is swapped properly.)
B. Software changes to a running system including procedures can be disallowed or restricted depending on system and amount of risk. (See what is allowed below for some possibilities.)

2.What is allowed? That depends a lot on system admin and management. It can very from system to system. At our 2 sites, we have a lot of leeway in making changes as long as we don't impact production.
A. At a minimum, monitoring of the clusters/systems. You need to know discover if something has gone wrong with the nodes or with the network and report it to the proper person(s). I call the main person(s) and send an email out to others.
I have AVAIL_MAN monitoring all of our 8 VMS critical production standalone systems and 1 critical production cluster. They are at 2 sites. I have it enabled to be able fix problems on the nodes. I also use MONITOR utility.
B. Use of console manager if a problem occurs.
C. System procedures can be changed. (Admin must assess the amount of risk. If there is too much risk of impacting a production system, don't do it.) Design procedures modularally so that they can be tested, preferably on a non-production system and preferably without having to reboot. This can depend on the amount of confidence or prior experiences management has with the system admin. At some sites, the system manager may need to get approval.
D. Approved personnel (this can be software configuration management or developer or someone else depending on your company) can make changes on production systems. At one site, I had to get signed approvals prior to making changes to a live production system. Developer may or may not need prior approval to work on live production system. Rules for this can be stringent or more relaxed. (Developers must acess the amount of risk. If desired, developers must log changes to production systems or go through a configuration managemnet procedure.)
Lawrence
Willem Grooters
Honored Contributor

Re: What is allowed ?


Rules for this can be stringent or more relaxed. (Developers must acess the amount of risk. If desired, developers must log changes to production systems or go through a configuration managemnet procedure.)


If mission critical, there is NO option than go through a change management procedure. NO RELAXED RULES.

I'm working on both side of the fences so I _know_ that developers can best be kept miles away from production systems. For very experienced maintenance programmers - which is, IMHO, a separate discipline in programming and system development - you may once in a while need a one-time exception for those cases where analysis of a problem requires access because it cannot be done otherwise (and that DOES happen). But be VERY reluctant. It's up to them to prove they need access - and to prove they (and their tools) can be trusted in a production environment.

Willem
Willem Grooters
OpenVMS Developer & System Manager
Zahid Ghani
Frequent Advisor

Re: What is allowed ?

Wim
Another very interesting Topic!
I am sure most people if not all have Change Management of some sorts. There are some golden rules that I follow.
1. Carry out out risk analysis - is the change included in the list of actions that can be carried out with full cluster/site down/node down.
2. Spell out the risks -(for the benefits management and clients)
3. What are regression paths.
4. Plan the work and have cut off points.
5. Make sure management is aware of the risks and signs off the work. we use a Change management form.

Like many of you I have had bad experiences when the change that was deemed perfectly safe but turned out not to be. Like moving a monitor from the top of server caused ther the node to crash.

I would be interested to know is if people
categorise their risks like us and what they have in each category.
Category A- Full cluster shutdown
B- Site Shutdown
C- Node shutdown
D- Cluster up but no users (applications down)

Zahid
Ian Miller.
Honored Contributor

Re: What is allowed ?

change management so at least everyone knows who is doing what and when even if the approvers don't understand what you are doing :-)

Its all about management of risk. Whats the risk if you don't do X and whats the risk if you do. Pre-change testing, careful procedures, pre-determined procedures to back out the change and to deal with things that go wrong (because sometimes they will). Exactly what is allowed and what is not is very dependant on the system setup, applications, required availability and local politics.
____________________
Purely Personal Opinion
Zahid Ghani
Frequent Advisor

Re: What is allowed ?

Ian
"change management so at least everyone knows who is doing what and when even if the approvers don't understand what you are doing :-)"

I see you've dealt with similar managers. But seriously, if change approvers put their mark on a piece of paper and don't understand it then its their lookout. If things go wrong you have someone to point to.
My attitude is if I have carried out the risk analysis to best of my ability then that's all one can do. If there are lessons to be learnt then fair enough -amend the procedures for the future.

I agree generalising the risk categories is site dependant BUT there are some things that are common. Also What is expected to be safe someone might have had different experience. Sharing those experiences might save someone from a 'bad day'.
Zahid