LUN failover causing stale NFS mount points

Guy Beukes · ‎11-28-2006

Hello,

Im not sure exactly what category to place this under because it covers a host of issues.
We have quit a few installations that consist of an EVA5000 SAN hanging off two Itanium Unix Servers. The software we use works withing service guard and has soft NFS mounts to volume groups made from the disk groups presented by the SAN.

Unfortunately we occasionally experience power, switch or other failures that causes the LUN paths to failover. On the Unix boxes we use a combination of secure path and pvlinks, and when the LUN's failover the VG start using the alternate links automatically as they should. But this leaves stale NFS mounts and it causes many NFS write failures that end up using all of the CPU. We then have to move the package to the alternate node and restart the affected server.

Once the package has been moved, the mount points should have been removed from the second server but there still visible after entering the mount command.

Some snapshot messages from dmesg:
CPQswsp: Path c12t1d5 Failed (LUN 600508B40010344300007000014F0000 Controller P5849E1AAR5062 Array 50001FE15004A740 HBA fcd2)
CPQswsp: Path c12t1d6 Failed (LUN 600508B4001034430000700001550000 Controller P5849E1AAR5062 Array 50001FE15004A740 HBA fcd2)

CPQswsp: Availability for LUN 600508B4001034430000700001970000 changed to Reduced
CPQswsp: Availability for LUN 600508B40010344300007000019A0000 changed to Reduced

0/4/2/0: Device at device id 0xc0400 has disappeared from Name Server GPN_FT
(FCP type) response, or its 'Port World-Wide Name' has changed.
device id = loop id, for private loop devices
device id = nport ID, for fabric/public-loop devices
System won't be able to see LUNs behind this port.

CPQswsp: Availability for LUN 600508B4001034430000700001020000 changed to Alive
CPQswsp: Availability for LUN 600508B4001034430000700001090000 changed to Alive

NFS write failed for server may_ims4s: RPC: Timed out
NFS write failed for server may_ims4s: RPC: Timed out

As you can see, this is were it starts timing out.

So may questions are:

Is there a reason the NFS experiences such problems and is there a way to avoid it?

When it occurs, the only solution I have found online is to restart the server, does anyone know of a better solution?

Lastly, What are the most secure setting to use for NFS mounts in this situation?

Steven E. Protter · ‎11-28-2006

Shalom,

This is caused by your logical volume layout and general situation.

1) Your systems including disk should be able to endure short power flucations without any loss of service.

2) Take a look at lvdisplay -v output and lvm configuration to find the problem. If you have done something like strip a logical volume across LUNS, don't do that. Let the EVA handle disk coinfiguration, don't waste server cpu and i/o on that.

3) This problem should be avoidable by proper layout of lvm and configuration of nfs.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Stephen Doud · ‎11-29-2006

Does your Serviceguard configuration include the HA-NFS toolkit?

If so, what version?

Also, what version of HP-UX and Serviceguard are you using?

Guy Beukes · ‎11-29-2006

Hello Chaps,

HPUXBaseOS B.11.23 HP-UX Base OS
T1905BA A.11.16.00 Serviceguard
As for the NFS, Im pretty sure that its NFS v3, the latest one supported by HPUX, but if somebody can let me know how to determine this preciously Ill post the results.

The LVM is correctly laid out on the system and I've check the lvdisplay command to confirm that nothing is stripped across the luns.

I presume that the system is correctly configured because this only occurs when theres a lun failover on the unix boxes.

Michael Steele_2 · ‎11-29-2006

HP provides a MC/ServiceGuard NFS Toolkit for both 11i v1 and 11i v2. Right now the latest release for 11i v2 can be found at this link:

Serviceguard NFS Toolkit A.11.11.07 and A.11.23.06 Release Notes

http://docs.hp.com/en/ha.html#Highly%20Available%20NFS

While reading the A.11.23.06 release notes, symptoms that you are describing appear under the 'known problems and workaround' headings. For example:

Cause
HA/NFS uses relocatable IP addresses that migrate from a primary NFS server to an
adoptive node during a fail-over event. If you start AutoFS on an HA/NFS server after
Serviceguard is running, these relocatable IP addresses are included in AutoFSâ list of local IP
interfaces. AutoFS map entries referencing these relocatable addresses cause AutoFS to
create LOFS mounts to the exported local filesystems. These LOFS mounts can create
problems with Serviceguard, because HA/NFS does not unmount LOFS filesystems before a
package migration. AutoFS-managed LOFS mounts that are holding resources in the
exported filesystems, when Serviceguard tries to initiate a package failover, may render
Serviceguard unable to successfully unmount the local filesystems and migrate them to the
adoptive node.
Workaround
The workaround for this problem is to start automountd with the -L option. This can be
done by setting the AUTOMOUNTD_

Support Fatherhood - Stop Family Law

Guy Beukes · ‎11-29-2006

Thanks for the information. Not sure if this is the exact fix for the problem because serviceguard is able to move the package as well as the volume groups to the new server. Service is completely restored, except that the NFS mounts are still displayed on the other server even though there not accessible.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

LUN failover causing stale NFS mount points

LUN failover causing stale NFS mount points

Re: LUN failover causing stale NFS mount points

Re: LUN failover causing stale NFS mount points

Re: LUN failover causing stale NFS mount points

Re: LUN failover causing stale NFS mount points

Re: LUN failover causing stale NFS mount points