topic Re: Cant detect cluster failure in Operating System - HP-UX

Cant detect cluster failure

Stephen Andreassend — Mon, 29 Apr 2002 09:45:46 GMT

We have SG OPS 11.13 and Oracle 9i RAC installed.

When we disconnect the shared disk array from both nodes to test a failure scenario, the cluster doesnt detect the error, and hence nor does Oracle's listener since it is dependent on the cluster layer.

Is there a parameter that can influence the detection of hardware loss. Our cluster lock disk is now disconnected and neither node have halted the cluster automatically.

Thx
Steve

Re: Cant detect cluster failure

BFA6 — Mon, 29 Apr 2002 09:58:58 GMT

I would expect both nodes to start complaining that the cluster locak dis is missing. Are there any messages in /var/adm/syslog/syslog.log from cmcld ?

Hilary

Re: Cant detect cluster failure

Stephen Andreassend — Mon, 29 Apr 2002 10:01:15 GMT

No, though as expected lots of SCSI errors on both nodes, eg:

Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d2000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d4000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c6000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d8000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c9000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0ca000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0cc000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0df000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d0000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d1000, errno: 126, resid: 2048,

Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x092000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x094000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x086000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08a000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08c000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x09f000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 8192,
Apr 29 13:34:16 rac1 vmunix: blkno: 1056088, sectno: 2112176, offset: 1081434112, bcount: 8192.
Apr 29 13:34:16 rac1 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Apr 29 13:34:21 rac1 above message repeats 7 times

Re: Cant detect cluster failure

Stephen Andreassend — Mon, 29 Apr 2002 10:35:25 GMT

Plug in the disk array and away we go:

Apr 29 14:07:44 rac1 cmcld: Cluster lock /dev/dsk/c9t0d0 is back on-line

Still, the cluster view command said everything was healthy when the disks were disconnected for an hour. So no hardware failure detection was running.

Re: Cant detect cluster failure

melvyn burnard — Mon, 29 Apr 2002 12:21:35 GMT

Well the test you have perfromed is not what I would call a valid test, as to lose disc connectivity to BOTH nodes is a Multiple failure, something SG is not designed to react to correctly.
Also, it is LVM that is monitoring the file systems etc, not SG.
To get the effect of forcing a failure, you need to set up the packages to monitor a resource, namely the discs themselves, using EMS.

Re: Cant detect cluster failure

Stephen Andreassend — Mon, 29 Apr 2002 12:55:39 GMT

A correction of the situation.

The cables to the disks were pulled out only on 1 node, the other cluster node was still able to access the FC10 via the Brocade switch.

The cluster process did report the loss of access to the cluster lock disk in the syslog, however the cluster status was still reported as up and Oracle did not shutdown on the bad node despite its loss of disk access.

This meant that Oracle sessions connected to the bad node just hung until the Oracle instance was manually aborted. This could be a timeout related issue.

Re: Cant detect cluster failure

BFA6 — Mon, 29 Apr 2002 15:26:53 GMT

Melvyn,

If you have 2 paths to the shared disks from both nodes - primary & alternate link - and you pull both cables from one node to the shared disks, will that also count as a multiple failure ?

Hilary

Re: Cant detect cluster failure

melvyn burnard — Mon, 29 Apr 2002 19:14:28 GMT

pulling BOTH links is NOT a SPOF, but a MPOF, or Multiple Point of Failure.

Again, even if only one node lost all comms with the discs, this is not necessarily a ServiceGuard protected event, unless you have set up a package to monitor the availability of the discs uising EMS and HW or HA monitors.
One thing to remember, the LVM/SCSI code will see the discs as unavailable, but will attempt to retry EACH LUN until the PVtimeout value is reached, and then try the next PV.

ServiceGuard does not protect against what you have done, as it relies on eth LVM/Disc technology to provide the high availability either using Mirroring via 2 separate paths, or using RAID via PVlinks.

Re: Cant detect cluster failure

Stephen Andreassend — Mon, 29 Apr 2002 19:42:11 GMT

What we want to see is the Oracle instance abort on the node with no disk access, so all future connections to Oracle are connected to the surviving node instead.

Our problem is that the Oracle instance does not abort, so new connections just hang rather than get routed over to the other node.

Are you saying that the only way to achieve this functionality is to setup EMS to detect disk access loss and run a script to force an abort of Oracle?

thx
Steve

Re: Cant detect cluster failure

Tim Clemens — Mon, 29 Apr 2002 20:10:53 GMT

A monitored resource (EMS) is one way. However, I believe you're asking a higher level question. You want to detect when Oracle is not responding correctly including hangs. To do this you need to modify the instance monitoring script to connect to and select from Oracle AND make sure that script doesn't hang.
This can be done many ways, but one way is to kick off a background job that will timeout. Meanwhile your monitoring script checks Oracle's status. If Oracle checks out OK, then kill the background script. If the background script times out, it would check to see if the Oracle status script is still running. If it is, then Oracle could be hung.
Sounds messy, doesn't it.

Re: Cant detect cluster failure

Stephen Andreassend — Sat, 11 May 2002 08:03:00 GMT

We have an engineer from HP coming in who is a Service Guard specialist to configure the EMS to detect disk failure and shutdown Oracle.

Thx
Steve