1753844 Members
7843 Online
108806 Solutions
New Discussion юеВ

Re: What is allowed ?

 
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Zahid,


Category A- Full cluster shutdown

it shows again how different things can be!

That for us is truely the one REAL oh-no!

The way we implemented change ability is to provide Rolling Updates for (nearly) everything.

About _THE_ most important tool for this is simply part of VMS: Concealed Devices.

Any applic has (at least) 3 different concealed devices (_ROOTs): ONE for the "program" environment. Includes executables, procedures, steering params, etc. Together the stuff that may change from version to version, but is expected to remain unchanged for a specific version.
(at least) ONE _ROOT for "the database".
Taken to be the "data", ie, the files. tables, whatever, that remain, although the activity of the applic modifies the contents.
(at least) ONE _ROOT for (temporary) workfiles, LOG files etc.

Let us say an application needs a version upgrade:

We pick one cluster node to do the upgrading.
As all applications have their own DNS service name, it is rather simple to define that applic as NOT pointing to the upgrading node anymore. Since users have a logon time limit of 10 hours, the next day no users are connected to the applic on that node (check, of course). They can keep using the applic on the other nodes.
Now we define a new _ROOT for the new version of the applic program environment.

Copy the existing version to the new ROOT, and upgrade that to the new version of the applic.

Of the database root(s) there usually exists a TEST version as well as PROD; it is up to applic management to point out if this suffices for evaluation, or maybe a new full copy of PROD is needed.
Then the new version of the applic is configured to use the test data, and access to that combination is granted according to specs of applic mgt.
When _THEY_ are satisfied with the upgrade, a moment for production changeover is determined (usually coincident with the beginning of daily backup, when SLA's do allow a window of potentially reduced service). Changeover implies no more than the redefinition of applic_prod_ROOT from applic_oldversion_ROOT to applic_newversion_ROOT, eighther on the upgrading node together with a redef of the DNS service node, or clusterwide.

Usually the applic_TEST_DATA_ROOT stays available, but only to a limited set of users.
If applic mgt or EDUCATION department so desire, it is very easy to have another data ROOT, applic_TRAIN_DATA_ROOT.
For one applic that requests special training before allowing use, there exist a full-blown set of data roots, that reflects the course that is given on a regular basis.
At request of EDUCATION we clear away the applic_TRAIN_ROOTs, and copy the course start set back in. Any "stunts" in a previous course are gone, and the trainer knows exactly where he is starting from.

We also were able to use the same setup to quickly create a DEMO root with some fancy data, when it was decided to demonstrate a certain applic to colleagues from another regio that were considering the app.

As indicated in earlier postings, we DO need and have a rather dynamic environment, and this is the kind of flexibility that allows us to provide that with relatively very little System Management staff.

.. and considering Change Approval: we pretend, and have management convinced, that WE are the ones that have by far the best ability and resources to decide WHAT changes are acceptable, and WHEN.

We earned their trust be specifying and realising our frame of operation:

Line ONE: We will NOT go down
Line two: For the users, the system is just a TOOL, any issue should NOT be theirs, but OURs.
Line three: Any issue should not occur, but prevented

In living by these line, we were able to re-specify our SLA's to be much more demanding then they were before (and we certainly did NOT specify anything we are not sure of, anything out of our control is explicitly excepted!)

And of course, we added a last line: Re-read line ONE


.. but every site is different.

Cheers.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: What is allowed ?

Jan : losing a shadow member means losing being disaster tolerant. For me that is to be avoided. Furthermore, the shadow copy must be done asap. An important application job failed because Sybase activities were to slow. Conclusion : MUST be avoided.

Change Management : I still have to encounter one that functions better than a repeater.

My policy based upon experience :

NOTHING allowed except changing redundant disks (during low activity hours).
All other interventions : weekend.

But my Change Management decides ...

Wim
Wim
Uwe Zessin
Honored Contributor

Re: What is allowed ?

Even swapping disks can be dangerous! One of my colleagues put a new disk (directly from the shipping carrier) into a HSG80-based storage and the damn thing completely blocked the SCSI bus!
.
Lawrence Czlapinski
Trusted Contributor

Re: What is allowed ?

Wim, Since you have weekends to make changes, it is better that you can do the changes on the weekends.
On our critical production systems, we prefer to make changes during downtimes when feasible. However our production runs 24X7 and production downtimes can be infrequent.

Downtimes are usually due to hardware or network problems.
RAID CONTROLLERS: Murphy's law was in play on that situation. Have maintenance company check revision level prior to installation if at all possible. At one point we had the redundant raid controller fail on the production system. A change to the backup system hardware, which also has redundant raid controllers, was scheduled and done. One of the redundant raid controllers on that system failed before the original raid controller was fixed. Customer hadn't signed maintenance agreement with new maintenance company yet. Several replacement raid controllers weren't up to latest firmware revision code which was needed. Firm supplying the replacement raid controllers was supposed to check them before sending to the maintenance company but hadn't. We tried them but old revision controllers won't recognize the 9.1 GB RZ1DB-VW disks. Finally got a raid controller with latest revision on the original system. Also one of the disks on the redundant raid set was bad. We weighed the risks and scheduled a changeover back to original system hardware. For us redunancy on the other logicals drives outweighted the one bad disk.
For some reason the spare disk didn't work. A replacement disk was ordered and hot swap worked. Then it had to be VMS formatted for shadow disk use. Then the logical disk was brought back into the shadow set and shadow merged successfully.
Still had problems with revision level of raid controller with the backup system. Finally got a raid controller with latest revision on that system. Got the raid controller configured.
Sys Admins are responsible for keeping the production machines up and getting them back up as quickly as feasible if something goes wrong.
Our applications developers schedule changes with the respective managements for the affected systems.
Network Problems:
Historically some of our biggest problems have been network switch problems.
1. Last Thursday a Cisco Systems Catalyst 3550 (100 mb ethernet) switch hiccuped. It made 3 nodes unavailable to the users. Fortunately only one of the 3 nodes was a critical production system and it was a standalone system (usually not a plus). The nodes kept running but could only be accessed through Console Manager. I called my customer contact and called the network manager and left a voice mail on his cell phone. He called back and had me pull the plug on the switch. The switch came back up and network access was restored. Downtime was about 15 minutes.
2. Historically we had some bad network problems which would bring down one of our sites and our production clusters at that site. The solution was upgrading to a VLAN.
We our nodes connected to switches. One thing we learned from one outage was to have all members of a cluster at the site on one switch so that if another network switch caused a problem the cluster would at least same up even if it couldn't be accessed by users till network was restored.
3. Another network problem we had was that connectivity between the two sites was sometimes lost. This affected users and AVAIL_MAN monitoring of the two sites. The non-VMS part of the network had to restructured so that PC logins / authentications from the 2 sites could be done locally.
Lawrence
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Yes,

I had to admit to some deja-vu's reading Lawrence. And while he HAS some, if little, schedulable downtime, we in principle have NONE.
Wim: losing HSZ's on one site, to have those disks leave the shadowsets, did NOT reduce us to one-member sets: the other site had two members (shear luck). It IS one of the reasons why we have requested Engeneering to allow 4 (or preferably, for 3-site configs: 6) members per set!

And Networks: it indeed is a terrible dgradation of reliability that those are, at most organisations, no longer under the control of VMS System Management!! They routinely think M$-style uptimes are good achievements.
And it took us some time (and some unfriendlyness-to-colleagues!) to educatate SAN management in the VMS ways and standards (it paid of: nowadays THEY are proud when a partial disturbance does NOT affect the users).
In an earlier posting I refered to our SLA, and the explicit exceptions therein.
Those exceptions explicitly state that if systems are not reachable by (part of) the user-community (and they are spread over currently 58 buildings, up to 30 KM from the computer centra; and some applics rely on online info from countrywide central systems, 100's of KM away, and the unreachability is due to some network malfunction, then WE report full availability, ie, conformance to SLA.
If network or desktop (most, but by no means all, users access VMS from the desktop) prohibit access for some, or most, users, then THEIR SLA is violated.

Then again, if the average user has no access to his applic, the "THE computer fails again".
It took some time, but we HAVE lower and middle management understanding the difference.
And at that time we have UPPER-UPPER-UPPER management (ie, the government) decide that it is to be done ALL differently, "cheaper", and integrated, and they have delegated all technical detail to a "steering group", to whom VMS sounds like an ugly four-letter word.

(from Larry Niven): TANJ

(There Ain't No Justice)


---- Reading this back, it more-or-less sounds like I am write-off some frustrations. Maybe I am. I don't take it back. Sorry if I ve been annoying you with my frustrations. :-(


Still:

Cheers

Have one on me.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Ian Miller.
Honored Contributor

Re: What is allowed ?

Any change nomatter how trivial involves risk - its all about balancing risk and explaining this to the users, management, beancounters etc.

RE the previous reply by Jan - you've mentioned this before. Either you hope inertia of large organisations means the downgrade to a non-VMS solution does not happen or somebody high up wakes up to the risk of moving away from VMS. In these current troubled times avoidance of risk should be a good thing I would have thought.
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Ian:

In these current troubled times avoidance of risk should be a good thing I would have thought.


Yes, YOU would have thought!! (as would I)

Is there some way that you could don some (maybe only apparent, but absolutely convincing) authority on the subject, and then do a, publicized as VERY important, lecture to the right people? At that level, price will be no issue at all, but the convincing authority.....

just wishful thinking...


Cheers!


Have one on me.

Jan


Don't rust yours pelled jacker to fine doll missed aches.
Ian Miller.
Honored Contributor

Re: What is allowed ?

Jan, I don't know who you need - someone important in a suit I suppose - parhaps you could try Mark Gorham as he does talking to top executives.
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: What is allowed ?

The reason I asked this is because we had to plan a local scsi controller replacement. This meant 1 node of the GS160 down.

Intervention started. Power off not possible. 2nd qbb must be stopped too. Done that. Both clusters now in DRP (with writemap to allow mini copy afterwards).
Now not both qbbs are visible. 1 part must be replaced to restart the node. Part ordered and delivered within an hour.
That's where I'm now. Normally another hour and intervention is finished. Took 4 hours instead of 30 minutes and we lost all drp capability during these hours (not that important).

Think about what should have happened when I planned this during lunch hours ...

Wim
Wim
Jan van den Ende
Honored Contributor

Re: What is allowed ?

Yeahhhh

It's been noted before:

Murphy always wins in the end.

And in your case, having a non-service window: I fully understand that you are now VERY glad you decided to invest part of your weekend, just BECAUSE it took more than expected!

I am facing somthing similar:

The electrical wiring in the building of one of our computer rooms is getting updated (and upgraded: with all those new Intel systems we have to keep adding, we are using more and more power, and that needs to be aircoed away again => still more power).
This will be done segment-by-segment, and the disconnected segments in the meantime will be fed by a row of big Diesel generators (ugly stinking stuff in themselves!!).
Monday evening there will be switchover.
And although everone expects smooth going, we WILL be on-site!
Rationale: if everything goes as expected, we get a thank-you for the extra effort, (and get paid overtime), but... SHOULD something somehow go not completely smouth, then, if we am NOT there, THEN we will have some explaining to do!!

Our job is more or less like boy-scouting:

"Be Prepaired"

Cheers.


Have one on me.


jpe
Don't rust yours pelled jacker to fine doll missed aches.