Operating System - HP-UX
1856446 Members
2510 Online
104113 Solutions
New Discussion

Alternate Path failover - doesn't always work

 
Stack
Occasional Advisor

Alternate Path failover - doesn't always work

This issue seems to have been out there for a few years now, and I just came across it again.

Using alternate path failover on a SAN doesn't provide protection for all potential failures. For example, if I pull the fibre channel cable on the storage side of a switch, PV Links does not failover to the alternate path. If I pull the cable on the server side of the switch, it does failover.

Does anyone know of a patch that may address this, or a whitepaper documenting this? Third-party products like PowerPath do not have this issue, and it seems like a correctable problem.

Thanks

Scott Riley
Stack Computer, Inc.
5 REPLIES 5
Michael Steele_2
Honored Contributor

Re: Alternate Path failover - doesn't always work

'fcmsutil' doesn't see this because their domain stops at their HBA's and needs the cooperation of the switch to see what's beyond it. HBA's only autosense the SAN topology and have nothing to do with the configuration which is overseen by the switch.
Support Fatherhood - Stop Family Law
Jeff Schussele
Honored Contributor

Re: Alternate Path failover - doesn't always work

Hi Scott,

Appears to me this is a fundamental flaw in your SAN design.
Not only do you want the HBAs going to separate switch ports, but those switch ports should have *separate* paths from the switch to the array.
It sounds like you still have a SPOF (the switch -> array path in this case) that needs to be eliminated. Doesn't have to be dedicated (i.e. you can share those paths with other hosts) but MUST be a separate path from the other HBA in this system.

Rgds,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Stack
Occasional Advisor

Re: Alternate Path failover - doesn't always work

Michael,

The TD driver does perform a login to the storage device, so there is knowledge at least of what devices it has access to on the SAN. But I can see a difference in what the driver may see between losing the HBA->switch link versus losing the Switch->storage link. I think the difference there is with a failure on the HBA side of the switch, the driver can sense this immediately. With a failure on the storage side of the switch, the HBA must not be sensing a failure in the link (it still has link -- to the switch), and therefore PV Links must rely on an I/O timeout rather than a hard error.

Jeff,

There's no SPOF in the design. Dual independent fabrics, dual HBA's, dual storage controllers.

I'm running the test with an open telnet session to each of the Brocade switches. Using a portperfshow, I can see dynamically the flow of data to the storage devices. Pull the cable on the host side, and the alternate path takes over fairly quickly, about 30 seconds or so. Pull the cable on the storage side (of one of the *two* paths to the LUNS), and it just stops. It seems that PV Links is waiting to time out when there is no hard error to act upon.



Michael Steele_2
Honored Contributor

Re: Alternate Path failover - doesn't always work

In general I agree with you, hop count, path determination, circuit testing, these are all simple enough to add in. (* Remember Euler circuits from Discrete Math *)

I thought about this a lot yesterday and the only reason for failure that I could come up with had to do with marketing and sales, (* buy another utility *) or legacy. (* We've always done it this way. *) For example, a point to point topology doesn't need a switch and doesn't fit into this problem and Point to point existed before fabric.

I'd be interested to know if 'AutoPath' also did this.

Good luck with this worthy endeavor!
Support Fatherhood - Stop Family Law
Unix Administrator_5
Frequent Advisor

Re: Alternate Path failover - doesn't always work

We recently had a problem similar to this. If we pull the cable from the system side, switch side, or storage side, the failover to an alternate path fails about every third or fourth time. Even more serious, is that it hangs the system.

After two months of testing, crash dumps, vendor visits, hp found a bug in lvmkmd that has existed at least since 11.0.

They have plans to fix it in 11.23 but no plans prior to this.