Re: What is allowed ?

Wim Van den Wyngaert · ‎09-28-2004

I have a discussion with my management.

The question is "what interventions are allowed during working hours on a critical production system".

To make it more specific : it is a interbuilding config of 2 GS160 with each 2 qbb that form 2 clusters (1 cluster uses 50% of each GS160). Double hsg80.

On application level, 1 node can be brought down in 1 cluster but in the other cluster this must be avoided. Thus an intervention on the node that is down can have impact on the other node in that GS160.

Your opinions please ...

Wim

Wim

labadie_1 · ‎09-28-2004

I would say that, as long as it has NO impact on the production, all is allowed. So speaking simply, very little will be allowed.

I worked on a site (a bank), where new stuff (a simple modified .com submitted several times a day, for example) would be put in production only after at least 3 days test on a test system, and not on Friday (as it may impact the week-end batches), not on Thursday (as it may impact the Friday), but only on Monday to Wednesday.

It is redundant to say that the production is important and must not be impacted, no ?

Uwe Zessin · ‎09-28-2004

If you want to go safe - nothing is allowed. Even peeking into the system can have a fatal effect (there once was a bug in the SHOW CLUSTER utility that could cause a system crash).

Changing startup-related command procedures can have unwanted side effects, too. The only way to test them is to reboot. If you change, but delay the test you might experience an unexpected reboot which _will_ test your changes at a time you are not prepared.

.

Wim Van den Wyngaert · ‎09-28-2004

For those with a GS160 too :
1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole GS160

And what else went wrong that was a no risk ?

Wim

Jan van den Ende · ‎09-28-2004

wim,

Taking Uwe one step further:
If you want to be totally safe, then you should power down the system, and bury it in concrete.
But: somehow I do not think this will give a good return on investment.

Conclusion: what _HAS TO BE ALLOWED_ is the optimum (best try, or best guess perhaps) between risk and work.
Consider this: if you are not allowed anything, how will you notice, let alone repair, ANY kind of runaway resource use?
Remember: ALL software was ultimately human-made, and it runs on (if nothing else: because if quantum mechanics) somewhat imperfect hardware, so there ALWAYS will be a non-zero probability that at some time something WILL behave unplanned.
And THIS is where you need someone (NOT a drone, like the ones you indicated in a previous thread) to BE ABLE to notice, to actualy DO take notice, to UNDERSTAND what is noticed, to have sufficient knowledge to JUDGE what is noticed, be able to DETERMINE (the need for?) a corrective action, and be able and ALLOWED to (timely) take the necessary action.

Perhaps we should pick another concept of quantum mechanics:
the very act of OBSERVING a system DOES infuence that system, but without, most parameters of the system are UNDETERMINED.

THERE SIMPLY IS NO PERFECT SOLUTION!

And only someone who thoroughly knows the applicable business rules can define a configuration that comes close to meeting them.
The price will always be way above budget, and then there will have to be an evaluation of cost vs risk.

It is up to you, and to your professional qualities, to specify WHAT can be done, at WHAT cost. It is up to you to make full clear to responsible management WHAT are the variables in this, and HOW they influence one another (and I full well know that this is the most difficult!).
And then, it is up to them to decide WHAT they want to spend, RESULTING in what risks THEY accept.

In your configuration, doing maintenance one one node of one cluster possibly interfering with the operation of the second cluster may not be the best solution.
Then again: if there is a sufficient amount of non-"working hours", then you may have to decide that maintenance will have to be done during those hours.
Which of course will influence the cost: back to square one.

But THE REAL ISSUE in our work will be to have management KNOW in advance WHAT they get, and what they WILL NOT get. And there, as in any insurance issue, it will always be too expensive, until it is too late, then there will be too little coverage.

The whole issue boils down to finding the right balance, but only very few people are true equilibrists....

"May you live in interesting times"

fwiw,

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎09-28-2004

Wim,

1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole

We obviously are running different environments:
Your questions imply that you CAN also do these actions with systems down.

In our 365.25 * 24 configuration, we CAN take down some system, but only AFTER we make doubly sure that the redundant component(s) ARE able to take the full load. That includes pre-outreconfiguring the temporarily leaving system/controller/switch/....

For instance, changing batteries in a dual controller HAS to be done during operating hours, if only because the batteries do not last 7+ years.

In our case a major GOAL of dual (HSG), or triple (shadow sets) or even quadruple (nodes) redundancy IS to be able to do such hardware maintenance while remaining available (with sufficient capacity).

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Willem Grooters · ‎09-28-2004

Agreed with Jan - he's the true expert here...
But I would put it even further - not that it would help you immediately, I think - to the point where it is decieded to use IT (to assist in) running your business. It's THENM that a risk estamations should be made, to find out WHAT IF... Not only unplanned downtime (AKA crashes - that, as Jan scientifically explaned, can simply NOT be prevented) but planned downtime as well, which can, partly, be determined based on statistical evidence of (non-)robustness and weariness of hardware (MTBF, to name one), multiple usage (tape!) and software (OS, LP, applications), but also on (quite regular) monitoring of all components. Not to prevent all crashes, but most of them.
That will lead to the whole picture of the configuration, maintenance windows, risks, prevention - all the components of distaster tolerance and -prevention.

That being said, it is obvious that monitoring the systems is simply a requirement - and MUST be allowed. System updates - OS, LP, Applications) must indeed be tested, indeed on a separate system to prevent interaction with a production system, but that will only prove that no problem exists ON THAT TEST ENVIRONMENT. It won't be the first time that installing the updates in a production system would cause a problem there - even if it has been tested thoroughly on another system (nietwaar, Jan?)

Luckily, I would say, it's a VMS environment.
Hopefully, fully redundant - in which case interference can be prevented or at least minimized. As Jan already pointed out (while I was typing this...)

Willem

Willem Grooters
OpenVMS Developer & System Manager

Uwe Zessin · ‎09-28-2004

Jan, don't be silly...
we're talking about doing maintenance on a working system. A system that is powered down does not do any work. In that case you can do all maintenance that you want ;-)

.

Willem Grooters · ‎09-29-2004

Uwe,

In a non-redundant environment you have to be aware that exchanging elements (whatever it may be) simply requires shutdown.
The more redundance you build into your environment (based on your risk analysis, or just because you like it (and have money)) the likelier it becomes you can get on uninterrupted while exchanging hard- and software. That's what VMS clustering (and NSK) is all about.
In Jan's case, it IS a running, 24x356.25 system. The way it is set up makes it possible to have at least ONE machine running all, or most required applications, while other node(s) are being maintained. That's why ALL hardware and software could be upgraded without closing down the cluster for over 7 years. I guess that's what he meant to say: Have ONE machine down and let the other continue - in Wim's case: the one-side halves of two clusters. The question is - and that's a matter of risk analysis - if that than single-node cluster is safe enough.

Willem Grooters
OpenVMS Developer & System Manager

Wim Van den Wyngaert · ‎09-29-2004

Ok, this is all theory. But what do YOU allow on your systems.

I saw no risks interventions terminate badly :

1) controller battery change : disks thrown out of the shadow set, controller bugcheck, controller power cycle (I know, shadow_mbr_tmo was 2 minutes and the controller gives 3 minutes to the technichian to do the intervention)

2) controller boot : disk thrown out of the shadow set

3) fiber change : the new one didn't work and when putting back the old one, this didn't work either

4) power cycle and disks get broken

etc

Wim

Uwe Zessin · ‎09-29-2004

Willem,

I have been managing VMS clusters since 1986 - I know what redundancy is!

Talking about powering a system down and burying it in concrete as a response to my comment - I don't feel like treated serious.

.

Uwe Zessin · ‎09-29-2004

Well, Wim, doesn't that prove my point? Even interventions that seem without risk can go wrong for various reasons.

I have never worked in an environment that was that demanding like yours, but we didn't have any detailed rules of what was allowed and what not. It was decided on a case by case and the user base was much smaller. You have a highly complex system and I don't think you will ever be able to make a final list.

.

labadie_1 · ‎09-29-2004

Wim

on a site, we did 2 controller battery change on a HSD. The first one went fine, but the Compaq guy who did the second was not as good, and needed more than 2 minutes (in fact about 20 minutes :-(

So a controller battery change must be seen as a dangerous operation.

I think the simplest attitude is to plan any change, when there is no activity and we have some free time to repair. Yes this often means coming on Saturday afternoon or evening, work a good part of the night, and have a complete sunday if something goes wrong.

Martin P.J. Zinser · ‎09-29-2004

Hello Wim,

we have a simple rule: You do not have any
planned maintenance on the HW during production hours.

OTOH we are not yet trading 24h a day, so we have a window to do what we need to do.

Greetings, Martin

Jan van den Ende · ‎09-29-2004

Uwe,

" Talking about powering a system down and burying it in concrete as a response to my comment - I don't feel like treated serious. "

- I did not in least try to not take you seriously, far from it.

Actually, it was just a blind repeat of the starting line from (training & symposium) various sessions on the broad subject, both from my former life in chemistry, as in IT.

Actually, it is more or less the description of what you WOULD need if you ever intended to specify a system for Orange Book "A" certification.

Maybe you should take it as description of the utter limit, immediately showing the implied irrelevance.

Wim.

on changing HSx batteries:
We had dual-redundant HSZ40's, and now have dual-redundant HSG80's.... on redundant sites.
HBVS over both.

_IF_ on changing the batteries of one HS on one site, SOMEHOW things get messed up, and both leave service (has happened to us also, yes), THEN we fall back to reduced shadow sets from one site only, and the penalty will be shadow merge (finally, Engeneering, THANKS for HBMM...as soon as we hear enough from others to dare)
BUT: bottom line: we do NOT LOOSE service. (and next time we ask for a more experienced engeneer!)

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Lawrence Czlapinski · ‎09-29-2004

Wim, we run 24X7.
First rule is to keep the systems and applications running.
Second rule is if something goes wrong to notify the appropriate persons and get things working as soon as feasible. A lot depends on the system admins and management.
Our team of 2 system admins has a lot of leeway. Also our applications people have a lot of leeway.
1. What is not allowed?
A. Doing most hardware maintenance. We don't change live network connections connected to production systems unless it is needed to attempt fixing a critical problem.
On some sites, developers aren't allowed to make changes on a production system except during approved maintenance.
Exceptions:
We can swap an external tape drive. (This normally doesn't make things any worse. The tape drive is down already. This is also an advantage of external tape drives.)
We can swap a failed member of a non-raid shadowed disk drive. (Admin still has to ensure that the right disk is swapped properly.)
B. Software changes to a running system including procedures can be disallowed or restricted depending on system and amount of risk. (See what is allowed below for some possibilities.)

2.What is allowed? That depends a lot on system admin and management. It can very from system to system. At our 2 sites, we have a lot of leeway in making changes as long as we don't impact production.
A. At a minimum, monitoring of the clusters/systems. You need to know discover if something has gone wrong with the nodes or with the network and report it to the proper person(s). I call the main person(s) and send an email out to others.
I have AVAIL_MAN monitoring all of our 8 VMS critical production standalone systems and 1 critical production cluster. They are at 2 sites. I have it enabled to be able fix problems on the nodes. I also use MONITOR utility.
B. Use of console manager if a problem occurs.
C. System procedures can be changed. (Admin must assess the amount of risk. If there is too much risk of impacting a production system, don't do it.) Design procedures modularally so that they can be tested, preferably on a non-production system and preferably without having to reboot. This can depend on the amount of confidence or prior experiences management has with the system admin. At some sites, the system manager may need to get approval.
D. Approved personnel (this can be software configuration management or developer or someone else depending on your company) can make changes on production systems. At one site, I had to get signed approvals prior to making changes to a live production system. Developer may or may not need prior approval to work on live production system. Rules for this can be stringent or more relaxed. (Developers must acess the amount of risk. If desired, developers must log changes to production systems or go through a configuration managemnet procedure.)
Lawrence

Willem Grooters · ‎09-29-2004

Rules for this can be stringent or more relaxed. (Developers must acess the amount of risk. If desired, developers must log changes to production systems or go through a configuration managemnet procedure.)

If mission critical, there is NO option than go through a change management procedure. NO RELAXED RULES.

I'm working on both side of the fences so I _know_ that developers can best be kept miles away from production systems. For very experienced maintenance programmers - which is, IMHO, a separate discipline in programming and system development - you may once in a while need a one-time exception for those cases where analysis of a problem requires access because it cannot be done otherwise (and that DOES happen). But be VERY reluctant. It's up to them to prove they need access - and to prove they (and their tools) can be trusted in a production environment.

Willem

Willem Grooters
OpenVMS Developer & System Manager

Zahid Ghani · ‎09-29-2004

Wim
Another very interesting Topic!
I am sure most people if not all have Change Management of some sorts. There are some golden rules that I follow.
1. Carry out out risk analysis - is the change included in the list of actions that can be carried out with full cluster/site down/node down.
2. Spell out the risks -(for the benefits management and clients)
3. What are regression paths.
4. Plan the work and have cut off points.
5. Make sure management is aware of the risks and signs off the work. we use a Change management form.

Like many of you I have had bad experiences when the change that was deemed perfectly safe but turned out not to be. Like moving a monitor from the top of server caused ther the node to crash.

I would be interested to know is if people
categorise their risks like us and what they have in each category.
Category A- Full cluster shutdown
B- Site Shutdown
C- Node shutdown
D- Cluster up but no users (applications down)

Zahid

Ian Miller. · ‎09-29-2004

change management so at least everyone knows who is doing what and when even if the approvers don't understand what you are doing :-)

Its all about management of risk. Whats the risk if you don't do X and whats the risk if you do. Pre-change testing, careful procedures, pre-determined procedures to back out the change and to deal with things that go wrong (because sometimes they will). Exactly what is allowed and what is not is very dependant on the system setup, applications, required availability and local politics.

____________________
Purely Personal Opinion

Zahid Ghani · ‎09-29-2004

Ian
"change management so at least everyone knows who is doing what and when even if the approvers don't understand what you are doing :-)"

I see you've dealt with similar managers. But seriously, if change approvers put their mark on a piece of paper and don't understand it then its their lookout. If things go wrong you have someone to point to.
My attitude is if I have carried out the risk analysis to best of my ability then that's all one can do. If there are lessons to be learnt then fair enough -amend the procedures for the future.

I agree generalising the risk categories is site dependant BUT there are some things that are common. Also What is expected to be safe someone might have had different experience. Sharing those experiences might save someone from a 'bad day'.
Zahid

Jan van den Ende · ‎09-29-2004

Zahid,

Category A- Full cluster shutdown

it shows again how different things can be!

That for us is truely the one REAL oh-no!

The way we implemented change ability is to provide Rolling Updates for (nearly) everything.

About _THE_ most important tool for this is simply part of VMS: Concealed Devices.

Any applic has (at least) 3 different concealed devices (_ROOTs): ONE for the "program" environment. Includes executables, procedures, steering params, etc. Together the stuff that may change from version to version, but is expected to remain unchanged for a specific version.
(at least) ONE _ROOT for "the database".
Taken to be the "data", ie, the files. tables, whatever, that remain, although the activity of the applic modifies the contents.
(at least) ONE _ROOT for (temporary) workfiles, LOG files etc.

Let us say an application needs a version upgrade:

We pick one cluster node to do the upgrading.
As all applications have their own DNS service name, it is rather simple to define that applic as NOT pointing to the upgrading node anymore. Since users have a logon time limit of 10 hours, the next day no users are connected to the applic on that node (check, of course). They can keep using the applic on the other nodes.
Now we define a new _ROOT for the new version of the applic program environment.

Copy the existing version to the new ROOT, and upgrade that to the new version of the applic.

Of the database root(s) there usually exists a TEST version as well as PROD; it is up to applic management to point out if this suffices for evaluation, or maybe a new full copy of PROD is needed.
Then the new version of the applic is configured to use the test data, and access to that combination is granted according to specs of applic mgt.
When _THEY_ are satisfied with the upgrade, a moment for production changeover is determined (usually coincident with the beginning of daily backup, when SLA's do allow a window of potentially reduced service). Changeover implies no more than the redefinition of applic_prod_ROOT from applic_oldversion_ROOT to applic_newversion_ROOT, eighther on the upgrading node together with a redef of the DNS service node, or clusterwide.

Usually the applic_TEST_DATA_ROOT stays available, but only to a limited set of users.
If applic mgt or EDUCATION department so desire, it is very easy to have another data ROOT, applic_TRAIN_DATA_ROOT.
For one applic that requests special training before allowing use, there exist a full-blown set of data roots, that reflects the course that is given on a regular basis.
At request of EDUCATION we clear away the applic_TRAIN_ROOTs, and copy the course start set back in. Any "stunts" in a previous course are gone, and the trainer knows exactly where he is starting from.

We also were able to use the same setup to quickly create a DEMO root with some fancy data, when it was decided to demonstrate a certain applic to colleagues from another regio that were considering the app.

As indicated in earlier postings, we DO need and have a rather dynamic environment, and this is the kind of flexibility that allows us to provide that with relatively very little System Management staff.

.. and considering Change Approval: we pretend, and have management convinced, that WE are the ones that have by far the best ability and resources to decide WHAT changes are acceptable, and WHEN.

We earned their trust be specifying and realising our frame of operation:

Line ONE: We will NOT go down
Line two: For the users, the system is just a TOOL, any issue should NOT be theirs, but OURs.
Line three: Any issue should not occur, but prevented

In living by these line, we were able to re-specify our SLA's to be much more demanding then they were before (and we certainly did NOT specify anything we are not sure of, anything out of our control is explicitly excepted!)

And of course, we added a last line: Re-read line ONE

.. but every site is different.

Cheers.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Wim Van den Wyngaert · ‎09-30-2004

Jan : losing a shadow member means losing being disaster tolerant. For me that is to be avoided. Furthermore, the shadow copy must be done asap. An important application job failed because Sybase activities were to slow. Conclusion : MUST be avoided.

Change Management : I still have to encounter one that functions better than a repeater.

My policy based upon experience :

NOTHING allowed except changing redundant disks (during low activity hours).
All other interventions : weekend.

But my Change Management decides ...

Wim

Wim

Uwe Zessin · ‎09-30-2004

Even swapping disks can be dangerous! One of my colleagues put a new disk (directly from the shipping carrier) into a HSG80-based storage and the damn thing completely blocked the SCSI bus!

.

Lawrence Czlapinski · ‎09-30-2004

Wim, Since you have weekends to make changes, it is better that you can do the changes on the weekends.
On our critical production systems, we prefer to make changes during downtimes when feasible. However our production runs 24X7 and production downtimes can be infrequent.

Downtimes are usually due to hardware or network problems.
RAID CONTROLLERS: Murphy's law was in play on that situation. Have maintenance company check revision level prior to installation if at all possible. At one point we had the redundant raid controller fail on the production system. A change to the backup system hardware, which also has redundant raid controllers, was scheduled and done. One of the redundant raid controllers on that system failed before the original raid controller was fixed. Customer hadn't signed maintenance agreement with new maintenance company yet. Several replacement raid controllers weren't up to latest firmware revision code which was needed. Firm supplying the replacement raid controllers was supposed to check them before sending to the maintenance company but hadn't. We tried them but old revision controllers won't recognize the 9.1 GB RZ1DB-VW disks. Finally got a raid controller with latest revision on the original system. Also one of the disks on the redundant raid set was bad. We weighed the risks and scheduled a changeover back to original system hardware. For us redunancy on the other logicals drives outweighted the one bad disk.
For some reason the spare disk didn't work. A replacement disk was ordered and hot swap worked. Then it had to be VMS formatted for shadow disk use. Then the logical disk was brought back into the shadow set and shadow merged successfully.
Still had problems with revision level of raid controller with the backup system. Finally got a raid controller with latest revision on that system. Got the raid controller configured.
Sys Admins are responsible for keeping the production machines up and getting them back up as quickly as feasible if something goes wrong.
Our applications developers schedule changes with the respective managements for the affected systems.
Network Problems:
Historically some of our biggest problems have been network switch problems.
1. Last Thursday a Cisco Systems Catalyst 3550 (100 mb ethernet) switch hiccuped. It made 3 nodes unavailable to the users. Fortunately only one of the 3 nodes was a critical production system and it was a standalone system (usually not a plus). The nodes kept running but could only be accessed through Console Manager. I called my customer contact and called the network manager and left a voice mail on his cell phone. He called back and had me pull the plug on the switch. The switch came back up and network access was restored. Downtime was about 15 minutes.
2. Historically we had some bad network problems which would bring down one of our sites and our production clusters at that site. The solution was upgrading to a VLAN.
We our nodes connected to switches. One thing we learned from one outage was to have all members of a cluster at the site on one switch so that if another network switch caused a problem the cluster would at least same up even if it couldn't be accessed by users till network was restored.
3. Another network problem we had was that connectivity between the two sites was sometimes lost. This affected users and AVAIL_MAN monitoring of the two sites. The non-VMS part of the network had to restructured so that PC logins / authentications from the 2 sites could be done locally.
Lawrence

Jan van den Ende · ‎09-30-2004

Yes,

I had to admit to some deja-vu's reading Lawrence. And while he HAS some, if little, schedulable downtime, we in principle have NONE.
Wim: losing HSZ's on one site, to have those disks leave the shadowsets, did NOT reduce us to one-member sets: the other site had two members (shear luck). It IS one of the reasons why we have requested Engeneering to allow 4 (or preferably, for 3-site configs: 6) members per set!

And Networks: it indeed is a terrible dgradation of reliability that those are, at most organisations, no longer under the control of VMS System Management!! They routinely think M$-style uptimes are good achievements.
And it took us some time (and some unfriendlyness-to-colleagues!) to educatate SAN management in the VMS ways and standards (it paid of: nowadays THEY are proud when a partial disturbance does NOT affect the users).
In an earlier posting I refered to our SLA, and the explicit exceptions therein.
Those exceptions explicitly state that if systems are not reachable by (part of) the user-community (and they are spread over currently 58 buildings, up to 30 KM from the computer centra; and some applics rely on online info from countrywide central systems, 100's of KM away, and the unreachability is due to some network malfunction, then WE report full availability, ie, conformance to SLA.
If network or desktop (most, but by no means all, users access VMS from the desktop) prohibit access for some, or most, users, then THEIR SLA is violated.

Then again, if the average user has no access to his applic, the "THE computer fails again".
It took some time, but we HAVE lower and middle management understanding the difference.
And at that time we have UPPER-UPPER-UPPER management (ie, the government) decide that it is to be done ALL differently, "cheaper", and integrated, and they have delegated all technical detail to a "steering group", to whom VMS sounds like an ugly four-letter word.

(from Larry Niven): TANJ

(There Ain't No Justice)

---- Reading this back, it more-or-less sounds like I am write-off some frustrations. Maybe I am. I don't take it back. Sorry if I ve been annoying you with my frustrations. :-(

Still:

Cheers

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: What is allowed ?

What is allowed ?