- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Re: What is allowed ?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 08:53 PM
тАО09-28-2004 08:53 PM
What is allowed ?
The question is "what interventions are allowed during working hours on a critical production system".
To make it more specific : it is a interbuilding config of 2 GS160 with each 2 qbb that form 2 clusters (1 cluster uses 50% of each GS160). Double hsg80.
On application level, 1 node can be brought down in 1 cluster but in the other cluster this must be avoided. Thus an intervention on the node that is down can have impact on the other node in that GS160.
Your opinions please ...
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 09:02 PM
тАО09-28-2004 09:02 PM
Re: What is allowed ?
I worked on a site (a bank), where new stuff (a simple modified .com submitted several times a day, for example) would be put in production only after at least 3 days test on a test system, and not on Friday (as it may impact the week-end batches), not on Thursday (as it may impact the Friday), but only on Monday to Wednesday.
It is redundant to say that the production is important and must not be impacted, no ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 09:10 PM
тАО09-28-2004 09:10 PM
Re: What is allowed ?
Changing startup-related command procedures can have unwanted side effects, too. The only way to test them is to reboot. If you change, but delay the test you might experience an unexpected reboot which _will_ test your changes at a time you are not prepared.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 10:13 PM
тАО09-28-2004 10:13 PM
Re: What is allowed ?
1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole GS160
And what else went wrong that was a no risk ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 10:14 PM
тАО09-28-2004 10:14 PM
Re: What is allowed ?
Taking Uwe one step further:
If you want to be totally safe, then you should power down the system, and bury it in concrete.
But: somehow I do not think this will give a good return on investment.
Conclusion: what _HAS TO BE ALLOWED_ is the optimum (best try, or best guess perhaps) between risk and work.
Consider this: if you are not allowed anything, how will you notice, let alone repair, ANY kind of runaway resource use?
Remember: ALL software was ultimately human-made, and it runs on (if nothing else: because if quantum mechanics) somewhat imperfect hardware, so there ALWAYS will be a non-zero probability that at some time something WILL behave unplanned.
And THIS is where you need someone (NOT a drone, like the ones you indicated in a previous thread) to BE ABLE to notice, to actualy DO take notice, to UNDERSTAND what is noticed, to have sufficient knowledge to JUDGE what is noticed, be able to DETERMINE (the need for?) a corrective action, and be able and ALLOWED to (timely) take the necessary action.
Perhaps we should pick another concept of quantum mechanics:
the very act of OBSERVING a system DOES infuence that system, but without, most parameters of the system are UNDETERMINED.
THERE SIMPLY IS NO PERFECT SOLUTION!
And only someone who thoroughly knows the applicable business rules can define a configuration that comes close to meeting them.
The price will always be way above budget, and then there will have to be an evaluation of cost vs risk.
It is up to you, and to your professional qualities, to specify WHAT can be done, at WHAT cost. It is up to you to make full clear to responsible management WHAT are the variables in this, and HOW they influence one another (and I full well know that this is the most difficult!).
And then, it is up to them to decide WHAT they want to spend, RESULTING in what risks THEY accept.
In your configuration, doing maintenance one one node of one cluster possibly interfering with the operation of the second cluster may not be the best solution.
Then again: if there is a sufficient amount of non-"working hours", then you may have to decide that maintenance will have to be done during those hours.
Which of course will influence the cost: back to square one.
But THE REAL ISSUE in our work will be to have management KNOW in advance WHAT they get, and what they WILL NOT get. And there, as in any insurance issue, it will always be too expensive, until it is too late, then there will be too little coverage.
The whole issue boils down to finding the right balance, but only very few people are true equilibrists....
"May you live in interesting times"
fwiw,
Jan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 10:35 PM
тАО09-28-2004 10:35 PM
Re: What is allowed ?
1) Do you allow a battery change in a dual controller (I saw both controller cycle on a different type)
2) Do you allow hot swap of disks (no problems yet)
3) Do you allow interventions on tape drives or local scsi controllers without stopping the whole
We obviously are running different environments:
Your questions imply that you CAN also do these actions with systems down.
In our 365.25 * 24 configuration, we CAN take down some system, but only AFTER we make doubly sure that the redundant component(s) ARE able to take the full load. That includes pre-outreconfiguring the temporarily leaving system/controller/switch/....
For instance, changing batteries in a dual controller HAS to be done during operating hours, if only because the batteries do not last 7+ years.
In our case a major GOAL of dual (HSG), or triple (shadow sets) or even quadruple (nodes) redundancy IS to be able to do such hardware maintenance while remaining available (with sufficient capacity).
Jan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 10:41 PM
тАО09-28-2004 10:41 PM
Re: What is allowed ?
But I would put it even further - not that it would help you immediately, I think - to the point where it is decieded to use IT (to assist in) running your business. It's THENM that a risk estamations should be made, to find out WHAT IF... Not only unplanned downtime (AKA crashes - that, as Jan scientifically explaned, can simply NOT be prevented) but planned downtime as well, which can, partly, be determined based on statistical evidence of (non-)robustness and weariness of hardware (MTBF, to name one), multiple usage (tape!) and software (OS, LP, applications), but also on (quite regular) monitoring of all components. Not to prevent all crashes, but most of them.
That will lead to the whole picture of the configuration, maintenance windows, risks, prevention - all the components of distaster tolerance and -prevention.
That being said, it is obvious that monitoring the systems is simply a requirement - and MUST be allowed. System updates - OS, LP, Applications) must indeed be tested, indeed on a separate system to prevent interaction with a production system, but that will only prove that no problem exists ON THAT TEST ENVIRONMENT. It won't be the first time that installing the updates in a production system would cause a problem there - even if it has been tested thoroughly on another system (nietwaar, Jan?)
Luckily, I would say, it's a VMS environment.
Hopefully, fully redundant - in which case interference can be prevented or at least minimized. As Jan already pointed out (while I was typing this...)
Willem
OpenVMS Developer & System Manager
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-28-2004 11:24 PM
тАО09-28-2004 11:24 PM
Re: What is allowed ?
we're talking about doing maintenance on a working system. A system that is powered down does not do any work. In that case you can do all maintenance that you want ;-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-29-2004 12:00 AM
тАО09-29-2004 12:00 AM
Re: What is allowed ?
In a non-redundant environment you have to be aware that exchanging elements (whatever it may be) simply requires shutdown.
The more redundance you build into your environment (based on your risk analysis, or just because you like it (and have money)) the likelier it becomes you can get on uninterrupted while exchanging hard- and software. That's what VMS clustering (and NSK) is all about.
In Jan's case, it IS a running, 24x356.25 system. The way it is set up makes it possible to have at least ONE machine running all, or most required applications, while other node(s) are being maintained. That's why ALL hardware and software could be upgraded without closing down the cluster for over 7 years. I guess that's what he meant to say: Have ONE machine down and let the other continue - in Wim's case: the one-side halves of two clusters. The question is - and that's a matter of risk analysis - if that than single-node cluster is safe enough.
OpenVMS Developer & System Manager
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-29-2004 12:16 AM
тАО09-29-2004 12:16 AM
Re: What is allowed ?
I saw no risks interventions terminate badly :
1) controller battery change : disks thrown out of the shadow set, controller bugcheck, controller power cycle (I know, shadow_mbr_tmo was 2 minutes and the controller gives 3 minutes to the technichian to do the intervention)
2) controller boot : disk thrown out of the shadow set
3) fiber change : the new one didn't work and when putting back the old one, this didn't work either
4) power cycle and disks get broken
etc