1832991 Members
2445 Online
110048 Solutions
New Discussion

SecurePath failover time

 
NPD USER
Regular Advisor

SecurePath failover time

We have a 6 node Oracle RAC 10g cluster using rx7620, 4 2g hbas each. They are connected to two MDS 9509 switches in a redundant fabric configuration, with two hbas per server connected to each switch. The storage is XP12k, each hdba is zoned for two targets, so a total of 8 targets for the cluster, 8 paths per device. Cluster software is Oracle Clusterware, which uses a voting disk mechanism for cluster membership checkin. The rest of the Oracle data is managed by ASM, straigth lun, no volume manager. SecurePath v3.0F used for failover/load balancing. We have a total of 3 voting disks, all accessable from both switches. Load balance policy of RR is used in SP.

When we were doing HA testing, one test is to power down one of the fabric switches without losing cluster integrity. This results in two of the 4 hbas per server failing. SecurePath should handle the failover, however, the cluster did crash reporting that the voting disks were unavailable -- even the ones that were being accessed from a path on the active switch. Syslog did show the paths failing over to alternate paths.

I believe the SecurePath failover for the devices took too long and caused the CSS service to timeout, the CSS log indicated no response from the voting disk in 23 seconds. The paths that were not failed should have been fine,is it possible the number of devices that had to failover slowed the io to the paths that were still active?

Is there a way to change the timeouts on the devices via SecurePath, this is the HPswsp driver, no timeout setting in the 'autopath set' command. Also, are there any other settings that may help, like increasing the scsi queue_depth?