<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cant detect cluster failure in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712889#M713325</link>
    <description>A correction of the situation.&lt;BR /&gt;&lt;BR /&gt;The cables to the disks were pulled out only on 1 node, the other cluster node was still able to access the FC10 via the Brocade switch.&lt;BR /&gt;&lt;BR /&gt;The cluster process did report the loss of access to the cluster lock disk in the syslog, however the cluster status was still reported as up and Oracle did not shutdown on the bad node despite its loss of disk access.&lt;BR /&gt;&lt;BR /&gt;This meant that Oracle sessions connected to the bad node just hung until the Oracle instance was manually aborted.  This could be a timeout related issue.</description>
    <pubDate>Mon, 29 Apr 2002 12:55:39 GMT</pubDate>
    <dc:creator>Stephen Andreassend</dc:creator>
    <dc:date>2002-04-29T12:55:39Z</dc:date>
    <item>
      <title>Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712884#M713320</link>
      <description>We have SG OPS 11.13 and Oracle 9i RAC installed.&lt;BR /&gt;&lt;BR /&gt;When we disconnect the shared disk array from both nodes to test a failure scenario, the cluster doesnt detect the error, and hence nor does Oracle's listener since it is dependent on the cluster layer.&lt;BR /&gt;&lt;BR /&gt;Is there a parameter that can influence the detection of hardware loss.  Our cluster lock disk is now disconnected and neither node have halted the cluster automatically.&lt;BR /&gt;&lt;BR /&gt;Thx&lt;BR /&gt;Steve</description>
      <pubDate>Mon, 29 Apr 2002 09:45:46 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712884#M713320</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-04-29T09:45:46Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712885#M713321</link>
      <description>I would expect both nodes to start complaining that the cluster locak dis is missing.  Are there any messages in /var/adm/syslog/syslog.log from cmcld ?&lt;BR /&gt;&lt;BR /&gt;Hilary&lt;BR /&gt;</description>
      <pubDate>Mon, 29 Apr 2002 09:58:58 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712885#M713321</guid>
      <dc:creator>BFA6</dc:creator>
      <dc:date>2002-04-29T09:58:58Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712886#M713322</link>
      <description>No, though as expected lots of SCSI errors on both nodes, eg:&lt;BR /&gt;&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d2000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d4000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c6000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d8000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c9000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0ca000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0cc000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0df000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d0000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d1000, errno: 126, resid: 2048,&lt;BR /&gt;&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x092000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x094000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x086000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08a000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08c000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x09f000, errno: 126, resid: 2048,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 8192,&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix:    blkno: 1056088, sectno: 2112176, offset: 1081434112, bcount: 8192.&lt;BR /&gt;Apr 29 13:34:16 rac1 vmunix:    blkno: 8, sectno: 16, offset: 8192, bcount: 2048.&lt;BR /&gt;Apr 29 13:34:21 rac1  above message repeats 7 times</description>
      <pubDate>Mon, 29 Apr 2002 10:01:15 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712886#M713322</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-04-29T10:01:15Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712887#M713323</link>
      <description>Plug in the disk array and away we go:&lt;BR /&gt;&lt;BR /&gt;Apr 29 14:07:44 rac1 cmcld: Cluster lock /dev/dsk/c9t0d0 is back on-line&lt;BR /&gt;&lt;BR /&gt;Still, the cluster view command said everything was healthy when the disks were disconnected for an hour. So no hardware failure detection was running.</description>
      <pubDate>Mon, 29 Apr 2002 10:35:25 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712887#M713323</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-04-29T10:35:25Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712888#M713324</link>
      <description>Well the test you have perfromed is not what I would call a valid test, as to lose disc connectivity to BOTH nodes is a Multiple failure, something SG is not designed to react to correctly.&lt;BR /&gt;Also, it is LVM that is monitoring the file systems etc, not SG.&lt;BR /&gt;To get the effect of forcing a failure, you need to set up the packages to monitor a resource, namely the discs themselves, using EMS.</description>
      <pubDate>Mon, 29 Apr 2002 12:21:35 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712888#M713324</guid>
      <dc:creator>melvyn burnard</dc:creator>
      <dc:date>2002-04-29T12:21:35Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712889#M713325</link>
      <description>A correction of the situation.&lt;BR /&gt;&lt;BR /&gt;The cables to the disks were pulled out only on 1 node, the other cluster node was still able to access the FC10 via the Brocade switch.&lt;BR /&gt;&lt;BR /&gt;The cluster process did report the loss of access to the cluster lock disk in the syslog, however the cluster status was still reported as up and Oracle did not shutdown on the bad node despite its loss of disk access.&lt;BR /&gt;&lt;BR /&gt;This meant that Oracle sessions connected to the bad node just hung until the Oracle instance was manually aborted.  This could be a timeout related issue.</description>
      <pubDate>Mon, 29 Apr 2002 12:55:39 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712889#M713325</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-04-29T12:55:39Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712890#M713326</link>
      <description>Melvyn,&lt;BR /&gt;&lt;BR /&gt;If you have 2 paths to the shared disks from both nodes - primary &amp;amp; alternate link - and you pull both cables from one node to the shared disks, will that also count as a multiple failure ?&lt;BR /&gt;&lt;BR /&gt;Hilary&lt;BR /&gt;</description>
      <pubDate>Mon, 29 Apr 2002 15:26:53 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712890#M713326</guid>
      <dc:creator>BFA6</dc:creator>
      <dc:date>2002-04-29T15:26:53Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712891#M713327</link>
      <description>pulling BOTH links is NOT a SPOF, but a MPOF, or Multiple Point of Failure.&lt;BR /&gt;&lt;BR /&gt;Again, even if only one node lost all comms with the discs, this is not necessarily a ServiceGuard protected event, unless you have set up a package to monitor the availability of the discs uising EMS and HW or HA monitors.&lt;BR /&gt;One thing to remember, the LVM/SCSI code will see the discs as unavailable, but will attempt to retry EACH LUN until the PVtimeout value is reached, and then try the next PV.&lt;BR /&gt;&lt;BR /&gt;ServiceGuard does not protect against what you have done, as it relies on eth LVM/Disc technology to provide the high availability either using Mirroring via 2 separate paths, or using RAID via PVlinks.</description>
      <pubDate>Mon, 29 Apr 2002 19:14:28 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712891#M713327</guid>
      <dc:creator>melvyn burnard</dc:creator>
      <dc:date>2002-04-29T19:14:28Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712892#M713328</link>
      <description>What we want to see is the Oracle instance abort on the node with no disk access, so all future connections to Oracle are connected to the surviving node instead.&lt;BR /&gt;&lt;BR /&gt;Our problem is that the Oracle instance does not abort, so new connections just hang rather than get routed over to the other node.&lt;BR /&gt;&lt;BR /&gt;Are you saying that the only way to achieve this functionality is to setup EMS to detect disk access loss and run a script to force an abort of Oracle?&lt;BR /&gt;&lt;BR /&gt;thx&lt;BR /&gt;Steve</description>
      <pubDate>Mon, 29 Apr 2002 19:42:11 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712892#M713328</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-04-29T19:42:11Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712893#M713329</link>
      <description>A monitored resource (EMS) is one way. However, I believe you're asking a higher level question. You want to detect when Oracle is not responding correctly including hangs. To do this you need to modify the instance monitoring script to connect to and select from Oracle AND make sure that script doesn't hang. &lt;BR /&gt;This can be done many ways, but one way is to kick off a background job that will timeout. Meanwhile your monitoring script checks Oracle's status. If Oracle checks out OK, then kill the background script. If the background script times out, it would check to see if the Oracle status script is still running. If it is, then Oracle could be hung.&lt;BR /&gt;Sounds messy, doesn't it.</description>
      <pubDate>Mon, 29 Apr 2002 20:10:53 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712893#M713329</guid>
      <dc:creator>Tim Clemens</dc:creator>
      <dc:date>2002-04-29T20:10:53Z</dc:date>
    </item>
    <item>
      <title>Re: Cant detect cluster failure</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712894#M713330</link>
      <description>We have an engineer from HP coming in who is a Service Guard specialist to configure the EMS to detect disk failure and shutdown Oracle.&lt;BR /&gt;&lt;BR /&gt;Thx&lt;BR /&gt;Steve</description>
      <pubDate>Sat, 11 May 2002 08:03:00 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/cant-detect-cluster-failure/m-p/2712894#M713330</guid>
      <dc:creator>Stephen Andreassend</dc:creator>
      <dc:date>2002-05-11T08:03:00Z</dc:date>
    </item>
  </channel>
</rss>

