Cant detect cluster failure

Stephen Andreassend · ‎04-29-2002

We have SG OPS 11.13 and Oracle 9i RAC installed.

When we disconnect the shared disk array from both nodes to test a failure scenario, the cluster doesnt detect the error, and hence nor does Oracle's listener since it is dependent on the cluster layer.

Is there a parameter that can influence the detection of hardware loss. Our cluster lock disk is now disconnected and neither node have halted the cluster automatically.

Thx
Steve

BFA6 · ‎04-29-2002

I would expect both nodes to start complaining that the cluster locak dis is missing. Are there any messages in /var/adm/syslog/syslog.log from cmcld ?

Hilary

Stephen Andreassend · ‎04-29-2002

No, though as expected lots of SCSI errors on both nodes, eg:

Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d2000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d4000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c6000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d8000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c9000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0ca000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0cc000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0df000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d0000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d1000, errno: 126, resid: 2048,

Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x092000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x094000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x086000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08a000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08c000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x09f000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 8192,
Apr 29 13:34:16 rac1 vmunix: blkno: 1056088, sectno: 2112176, offset: 1081434112, bcount: 8192.
Apr 29 13:34:16 rac1 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Apr 29 13:34:21 rac1 above message repeats 7 times

Stephen Andreassend · ‎04-29-2002

Plug in the disk array and away we go:

Apr 29 14:07:44 rac1 cmcld: Cluster lock /dev/dsk/c9t0d0 is back on-line

Still, the cluster view command said everything was healthy when the disks were disconnected for an hour. So no hardware failure detection was running.

melvyn burnard · ‎04-29-2002

Well the test you have perfromed is not what I would call a valid test, as to lose disc connectivity to BOTH nodes is a Multiple failure, something SG is not designed to react to correctly.
Also, it is LVM that is monitoring the file systems etc, not SG.
To get the effect of forcing a failure, you need to set up the packages to monitor a resource, namely the discs themselves, using EMS.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Stephen Andreassend · ‎04-29-2002

A correction of the situation.

The cables to the disks were pulled out only on 1 node, the other cluster node was still able to access the FC10 via the Brocade switch.

The cluster process did report the loss of access to the cluster lock disk in the syslog, however the cluster status was still reported as up and Oracle did not shutdown on the bad node despite its loss of disk access.

This meant that Oracle sessions connected to the bad node just hung until the Oracle instance was manually aborted. This could be a timeout related issue.

BFA6 · ‎04-29-2002

Melvyn,

If you have 2 paths to the shared disks from both nodes - primary & alternate link - and you pull both cables from one node to the shared disks, will that also count as a multiple failure ?

Hilary

melvyn burnard · ‎04-29-2002

pulling BOTH links is NOT a SPOF, but a MPOF, or Multiple Point of Failure.

Again, even if only one node lost all comms with the discs, this is not necessarily a ServiceGuard protected event, unless you have set up a package to monitor the availability of the discs uising EMS and HW or HA monitors.
One thing to remember, the LVM/SCSI code will see the discs as unavailable, but will attempt to retry EACH LUN until the PVtimeout value is reached, and then try the next PV.

ServiceGuard does not protect against what you have done, as it relies on eth LVM/Disc technology to provide the high availability either using Mirroring via 2 separate paths, or using RAID via PVlinks.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Stephen Andreassend · ‎04-29-2002

What we want to see is the Oracle instance abort on the node with no disk access, so all future connections to Oracle are connected to the surviving node instead.

Our problem is that the Oracle instance does not abort, so new connections just hang rather than get routed over to the other node.

Are you saying that the only way to achieve this functionality is to setup EMS to detect disk access loss and run a script to force an abort of Oracle?

thx
Steve

Tim Clemens · ‎04-29-2002

A monitored resource (EMS) is one way. However, I believe you're asking a higher level question. You want to detect when Oracle is not responding correctly including hangs. To do this you need to modify the instance monitoring script to connect to and select from Oracle AND make sure that script doesn't hang.
This can be done many ways, but one way is to kick off a background job that will timeout. Meanwhile your monitoring script checks Oracle's status. If Oracle checks out OK, then kill the background script. If the background script times out, it would check to see if the Oracle status script is still running. If it is, then Oracle could be hung.
Sounds messy, doesn't it.

Stephen Andreassend · ‎05-11-2002

We have an engineer from HP coming in who is a Service Guard specialist to configure the EMS to detect disk failure and shutdown Oracle.

Thx
Steve

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Cant detect cluster failure

Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure

Re: Cant detect cluster failure