- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Cant detect cluster failure
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 02:45 AM
тАО04-29-2002 02:45 AM
When we disconnect the shared disk array from both nodes to test a failure scenario, the cluster doesnt detect the error, and hence nor does Oracle's listener since it is dependent on the cluster layer.
Is there a parameter that can influence the detection of hardware loss. Our cluster lock disk is now disconnected and neither node have halted the cluster automatically.
Thx
Steve
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 02:58 AM
тАО04-29-2002 02:58 AM
Re: Cant detect cluster failure
Hilary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 03:01 AM
тАО04-29-2002 03:01 AM
Re: Cant detect cluster failure
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d2000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d4000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c6000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d8000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0c9000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0ca000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0cc000, errno: 126, resid: 2048,
Apr 29 15:05:48 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0df000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d0000, errno: 126, resid: 2048,
Apr 29 15:05:53 rac2 vmunix: SCSI: Read error -- dev: b 31 0x0d1000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x092000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x094000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x086000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08a000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x08c000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x09f000, errno: 126, resid: 2048,
Apr 29 13:34:16 rac1 vmunix: SCSI: Read error -- dev: b 31 0x091000, errno: 126, resid: 8192,
Apr 29 13:34:16 rac1 vmunix: blkno: 1056088, sectno: 2112176, offset: 1081434112, bcount: 8192.
Apr 29 13:34:16 rac1 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Apr 29 13:34:21 rac1 above message repeats 7 times
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 03:35 AM
тАО04-29-2002 03:35 AM
Re: Cant detect cluster failure
Apr 29 14:07:44 rac1 cmcld: Cluster lock /dev/dsk/c9t0d0 is back on-line
Still, the cluster view command said everything was healthy when the disks were disconnected for an hour. So no hardware failure detection was running.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 05:21 AM
тАО04-29-2002 05:21 AM
Re: Cant detect cluster failure
Also, it is LVM that is monitoring the file systems etc, not SG.
To get the effect of forcing a failure, you need to set up the packages to monitor a resource, namely the discs themselves, using EMS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 05:55 AM
тАО04-29-2002 05:55 AM
Re: Cant detect cluster failure
The cables to the disks were pulled out only on 1 node, the other cluster node was still able to access the FC10 via the Brocade switch.
The cluster process did report the loss of access to the cluster lock disk in the syslog, however the cluster status was still reported as up and Oracle did not shutdown on the bad node despite its loss of disk access.
This meant that Oracle sessions connected to the bad node just hung until the Oracle instance was manually aborted. This could be a timeout related issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 08:26 AM
тАО04-29-2002 08:26 AM
Re: Cant detect cluster failure
If you have 2 paths to the shared disks from both nodes - primary & alternate link - and you pull both cables from one node to the shared disks, will that also count as a multiple failure ?
Hilary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 12:14 PM
тАО04-29-2002 12:14 PM
SolutionAgain, even if only one node lost all comms with the discs, this is not necessarily a ServiceGuard protected event, unless you have set up a package to monitor the availability of the discs uising EMS and HW or HA monitors.
One thing to remember, the LVM/SCSI code will see the discs as unavailable, but will attempt to retry EACH LUN until the PVtimeout value is reached, and then try the next PV.
ServiceGuard does not protect against what you have done, as it relies on eth LVM/Disc technology to provide the high availability either using Mirroring via 2 separate paths, or using RAID via PVlinks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 12:42 PM
тАО04-29-2002 12:42 PM
Re: Cant detect cluster failure
Our problem is that the Oracle instance does not abort, so new connections just hang rather than get routed over to the other node.
Are you saying that the only way to achieve this functionality is to setup EMS to detect disk access loss and run a script to force an abort of Oracle?
thx
Steve
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-29-2002 01:10 PM
тАО04-29-2002 01:10 PM
Re: Cant detect cluster failure
This can be done many ways, but one way is to kick off a background job that will timeout. Meanwhile your monitoring script checks Oracle's status. If Oracle checks out OK, then kill the background script. If the background script times out, it would check to see if the Oracle status script is still running. If it is, then Oracle could be hung.
Sounds messy, doesn't it.