1825644 Members
3676 Online
109685 Solutions
New Discussion

Desaster Test

 
SOLVED
Go to solution
Paul Barmettler
Frequent Advisor

Desaster Test

Dear experts,
We have several 2-node Clusters in 2 different datacenters runnig SAP with the MC/SG SAP extension.
We recently have performed a desaster test to check if things behave as they should. For this thest we have cut all lines (Network, Fibre-channel...) to simulate the loss of the entire Datacenter.
The primary nodes with the oracle DB's are running in the datacenter we shut down.
Result: the alternate nodes have TOC'ed, the primary nodes remained up, and could not remove the volume groups, until we have done a manual TOC. After reboot of the alternate nodes and cmruncl the cluster came up (asked to make shure the primary nodes are really down) and Oracle/SAP was started.
NODE_TIMEOUT=6000000
HEARTBEAT_INTERVAL=2000000
NODE_FAIL_FAST_ENABLED=yes
Is this behaviour correct?
(complete config attatched)
Thanx
6 REPLIES 6
Christopher McCray_1
Honored Contributor
Solution

Re: Desaster Test

I could not retrieve your attachment, but from your description, yes it is normal, although you didn't need to perform a manual TOC on the primarys to deactivate the vgs. When you experience a failure of some kind (i.e. network), the "heartbeat" that is sent between the two nodes stops, which causes each node to race for the cluster lock disk (required in two node clusters). The node that gets the lock disk forms a one node cluster, the other panics. when you perfom a cmruncl, the following happens with respect to your applications:

1, volume group activation
2. check and mount file systems
3. assign pkg ip
4. start user defined run commands
5. start service processes

which is why your oracle and sap were started automatically after cmruncl (they are part of #4)

Hope this helps
Chris
It wasn't me!!!!
Paul Barmettler
Frequent Advisor

Re: Desaster Test

Thanks Chris
I didn't mention that the VG-Lockdisks reside on the XP in the second Datacenter, the ones the primary nodes can't reach anymore.
I'll try to attach the config once more.
Christopher McCray_1
Honored Contributor

Re: Desaster Test

I want to make sure I understand the physical layout. Is it one of the following:

1) the servers (alt and pri) in one datacenter and the xp in the other?
2) the primary servers in one data center and the alts in the other; in this case which servers is the xp co-located with?

Sorry, but I must have missed the xp location part, but it would help a lot if you can answer the above. Thanks.

Chris
It wasn't me!!!!
Carsten Krege
Honored Contributor

Re: Desaster Test

You should really provide the syslog of both nodes to give us a better picture of the event.

One thing what I believe what you might oversee is that both nodes need to get access to BOTH cluster lock disks. Under specific circumstances (ie. the return code of the system call to access the cluster lock indicates an I/O error or powerfailure of the disk) SG requires only one of the two lock disk to form a cluster.

Without seeing the syslogs, I dare to maintain that the primary node got the cluster lock (of the alternate data center?) and the alternate did not and therefore performed a TOC. The syslogs will give us the details.

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG
Christopher McCray_1
Honored Contributor

Re: Desaster Test

That's what I was thinking and hope I was conveying, Karsten. I was thinking that the primary servers some how had connectivity to a cluster lock disk. Thanks for clearing that up for me; I fear I was too vague.

It is possible that this is the case, Neuhaus, but please send us the logs Karsten mentioned and thanks.

Chris

It wasn't me!!!!
Paul Barmettler
Frequent Advisor

Re: Desaster Test

Thanks for your responses.
I guess Carsten you are right. One of my colleagues told me, that he has seen the message on the console of an alternative node, tha he was not able to optain the cluster lock disk. Pobbably because there was too much time between cutting the LAN cables and cutting the FC cables.
Sorry I cant' provide syslogs, because I din't save them before the next reboot!!
For your understanding I attach the physical layout.