topic Re: Desaster Test in Operating System - HP-UX

Desaster Test

Paul Barmettler — Tue, 23 Oct 2001 09:07:40 GMT

Dear experts,
We have several 2-node Clusters in 2 different datacenters runnig SAP with the MC/SG SAP extension.
We recently have performed a desaster test to check if things behave as they should. For this thest we have cut all lines (Network, Fibre-channel...) to simulate the loss of the entire Datacenter.
The primary nodes with the oracle DB's are running in the datacenter we shut down.
Result: the alternate nodes have TOC'ed, the primary nodes remained up, and could not remove the volume groups, until we have done a manual TOC. After reboot of the alternate nodes and cmruncl the cluster came up (asked to make shure the primary nodes are really down) and Oracle/SAP was started.
NODE_TIMEOUT=6000000
HEARTBEAT_INTERVAL=2000000
NODE_FAIL_FAST_ENABLED=yes
Is this behaviour correct?
(complete config attatched)
Thanx

Re: Desaster Test

Christopher McCray_1 — Tue, 23 Oct 2001 09:19:51 GMT

I could not retrieve your attachment, but from your description, yes it is normal, although you didn't need to perform a manual TOC on the primarys to deactivate the vgs. When you experience a failure of some kind (i.e. network), the "heartbeat" that is sent between the two nodes stops, which causes each node to race for the cluster lock disk (required in two node clusters). The node that gets the lock disk forms a one node cluster, the other panics. when you perfom a cmruncl, the following happens with respect to your applications:

1, volume group activation
2. check and mount file systems
3. assign pkg ip
4. start user defined run commands
5. start service processes

which is why your oracle and sap were started automatically after cmruncl (they are part of #4)

Hope this helps
Chris

Re: Desaster Test

Paul Barmettler — Tue, 23 Oct 2001 13:02:08 GMT

Thanks Chris
I didn't mention that the VG-Lockdisks reside on the XP in the second Datacenter, the ones the primary nodes can't reach anymore.
I'll try to attach the config once more.

Re: Desaster Test

Christopher McCray_1 — Tue, 23 Oct 2001 13:15:17 GMT

I want to make sure I understand the physical layout. Is it one of the following:

1) the servers (alt and pri) in one datacenter and the xp in the other?
2) the primary servers in one data center and the alts in the other; in this case which servers is the xp co-located with?

Sorry, but I must have missed the xp location part, but it would help a lot if you can answer the above. Thanks.

Chris

Re: Desaster Test

Carsten Krege — Tue, 23 Oct 2001 14:45:02 GMT

You should really provide the syslog of both nodes to give us a better picture of the event.

One thing what I believe what you might oversee is that both nodes need to get access to BOTH cluster lock disks. Under specific circumstances (ie. the return code of the system call to access the cluster lock indicates an I/O error or powerfailure of the disk) SG requires only one of the two lock disk to form a cluster.

Without seeing the syslogs, I dare to maintain that the primary node got the cluster lock (of the alternate data center?) and the alternate did not and therefore performed a TOC. The syslogs will give us the details.

Carsten

Re: Desaster Test

Christopher McCray_1 — Tue, 23 Oct 2001 14:51:04 GMT

That's what I was thinking and hope I was conveying, Karsten. I was thinking that the primary servers some how had connectivity to a cluster lock disk. Thanks for clearing that up for me; I fear I was too vague.

It is possible that this is the case, Neuhaus, but please send us the logs Karsten mentioned and thanks.

Chris

Re: Desaster Test

Paul Barmettler — Wed, 24 Oct 2001 10:53:18 GMT

Thanks for your responses.
I guess Carsten you are right. One of my colleagues told me, that he has seen the message on the console of an alternative node, tha he was not able to optain the cluster lock disk. Pobbably because there was too much time between cutting the LAN cables and cutting the FC cables.
Sorry I cant' provide syslogs, because I din't save them before the next reboot!!
For your understanding I attach the physical layout.