1827214 Members
2694 Online
109716 Solutions
New Discussion

Disaster procedures

 
Wim Van den Wyngaert
Honored Contributor

Disaster procedures

I need some documents.

I have an interbuilding cluster of 2 servers and 1 quorum node. They are connected with FDDI (no FC).

Can anyone share procedures (for dummies) to analyze the cluster en to take correct actions ? I'm thinking of network failures, network card failures, route problems, disk failures, fiber instable, fiber failures, VMS unstable, creeping doom, building failure (9/11), double disk failures, triple disk failures (we use shadowing of raid sets with auto spares), etc.

Wim
6 REPLIES 6
Mobeen_1
Esteemed Contributor

Re: Disaster procedures

Wim,
Are you just looking for DR Templates in general or looking for the actual DR plans that any one has for a OpenVMS cluster

regards
Mobeen
Wim Van den Wyngaert
Honored Contributor

Re: Disaster procedures

Mobeen,

Both but by preference the actually used documents. With the exact commands to type.

Wim
Wim
Mobeen_1
Esteemed Contributor

Re: Disaster procedures

Wim,
Unfortunately i would not be able to share the DR Plans that i have. But i would like to point you to some resources that would be helpfull

http://www.calamityprevention.com/

http://www.disasterrecovery.com/

I did take a look at some documents (examples) contained there in.

regards
Mobeen
labadie_1
Honored Contributor

Re: Disaster procedures

Starting at Vms 7.3, you have
$ mc scacp
to monitor the SCS traffic.

Cockpit Manager will suit your needs http://www.hp.be/CockpitMgr

Amds or availability Manager can help too.

The tools from keith parris will help too

http://h71000.www7.hp.com/freeware/freeware60/kp_clustertools/

regards

Gerard

Martin P.J. Zinser
Honored Contributor

Re: Disaster procedures

Hello Wim,

don't laugh, but the most important commands to
use to check if you have correctly are very simple, show device, DECnet loop, IP ping etc.

As for planning and preparation, first get a clear plan of your current layout on a box, cluster and network level. Do not forget about infratstructure (Power, air conditioning). Identify single points of failure and if possible remove them. Most important, once you have your plan in place, test it. This is an iterative process, as you are likely to find things during testing that you did not think about in before.

Greetings, Martin
Keith Parris
Trusted Contributor

Re: Disaster procedures

Are you simply taking advantage of the fact that you can easily and very inexpensively provide a level of disaster tolerance because your systems are running OpenVMS and Volume Shadowing software and you have fiber between buildings? or are your applications really mission-critical and thus you really need full disaster tolerance?

The easiest and safest way to set up a new disaster-tolerant cluster is to buy the DTCS (Disaster Tolerant Cluster Services) package from HP. That contains all the assistance one needs in terms of design, planning, implementation, testing, training, and documentation. It doesn't add a significant percentage to the cost of the configuration. It includes software to provide monitoring and operational functions and to provide a framework and integration for an entire set of monitoring tools. It provides information on how to operate your DT cluster both in normal conditions as well as various failure scenarios. It helps you avoid common mistakes. Without it, you have to learn from real-life experience, which can be risky.

For full disaster tolerance, you'll need a number of things in place for monitoring and control -- monitoring of all the network links (for this, LAVC$FAILURE_ANALYSIS is very helpful -- see the article "Local Area Network Cluster Interconnect Monitoring" in the OpenVMS Technical Journal, V2, at http://h71000.www7.hp.com/openvms/journal/v2/index.html), console management so you can boot and reboot systems remotely and monitor and record system console output, monitoring in place to detect node failures (since you have no inter-site FC link, you're dependent on at least one node being up and running at each site to provide MSCP-serving access to the disks at that site to keep the shadowsets alive), and monitoring of shadowset membership (this is very crucial).

MOUNT commands in startup procedures need to be carefully designed to avoid untimely or wrong-way shadow copies. Some new /POLICY qualifiers can be very helpful in this area.

Is your quorum node at one of the two main sites, or at a 3rd location? If it is at a 3rd site, then a quorum recovery tool (Availability Manager or DECamds can do the job) is perhaps less crucial than it would be if you have only 2 sites, but should still be in place.

There's a lot of information on disaster-tolerant clusters, including class nodes for a 1-day seminar, at http://www2.openvms.org/kparris/ and http://www.geocities.com/keithparris/

In addition to the ones cited earlier, here are some other general DR/BC resources:
Contingency Planning & Management magazine: http://www.contingencyplanning.com/
Continuity Insights Magazine: http://www.continuityinsights.com/
The Uptime Institute: http://upsite.com/