NFS VIP problem

Jose M. del Rio — Thu, 23 Apr 2009 15:37:35 GMT

Hi,
Some reading for the insomniacs:
In order to share some filesystems among several servers, we are using a script derived from those in "MC/ServiceGuard NFS" (http://docs.hp.com/en/ha.html#Highly%20Available%20NFS). That is:
- we use a VIP that is assigned to one of the nodes
- that node activates the VG, mounts locally its filesystems and exports them to the rest
- the rest of the nodes (and the node itself) mount those filesystems from the VIP via NFS, as in the "Server-to-Server Cross-Mounts" option in ServiceGuard NFS.

The SERVER or CLIENT roles of the nodes can be switched using this script (attached).
The problem is: from time to time, after a role change, the clients are not able to mount via NFS the remote shares.
For instance, if the node names are e5 and e6 and:
- e6 is acting both as server & client
- e5 is acting as client
then, after issuing:
e5:
./nfs_Catastro.cntl stop_client

e6:
./nfs_Catastro.cntl stop_client
./nfs_Catastro.cntl stop_server

e5:
./nfs_Catastro.cntl start_server
./nfs_Catastro.cntl start_client

e6:
./nfs_Catastro.cntl start_client

the latter start_client fails to mount immediately the first remote share (a manual "mount colada_nfs:/u6_local /u6" also fails).
It keeps trying and issuing messages like:
NFS server colada_nfs not responding still trying
NFS server colada_nfs not responding still trying
NFS server colada_nfs not responding still trying
...

and after, say, 10' it succeeds.
I have enabled NFS logging but nothing revealing shows in the logs. I have sniffed and a succeeding sequence is:

UDP:
e6 -> e5:111 GETPORT MOUNTD (100005)
<- 57585
e6 -> e5:57585 NULL call
<- NULL reply
e6 -> e5:57585 MNT /u6_local
<- OK, filehandle = ...
e6 -> e5:111 GETPORT NFS (100003)
<- 2049

TCP:
e6 -> e5:2049 NULL call
<- NULL reply
e6 -> e5:2049 GETATTR filehandle = ... (*)
<- directory mode:0755 uid:0 gid:0
e6 -> e5:2049 FSINFO filehandle = ...
<- max file size, supports symbolic links...
...

In a failing one, the first part (UDP) works fine: mountd.log shows the requests being immediately granted:
rpc.mountd: mount: mount request from ensnada6 granted.
However, as for the TCP part, the packet marked with (*) includes:
- as source IP from the client node, the VIP!, which no longer is assigned to any interface in that node (in fact, the ifconfig lanX:N 0.0.0.0 removed the secondary interface)
- as destination IP, the VIP, which is correct and correctly assigned to the new SERVER node
- both src and dst MAC addresses are correct.

13:12:05.882737 IP (tos 0x0, ttl 64, id 2168, offset 0, flags [DF], length: 152
) colada_nfs.cata.103927827 > colada_nfs.cata.nfs: 112 getattr fh 4100,131073/2
0x0000: 0018 7100 f026 0013 21ea 2745 0800 4500 ..q..&..!.'E..E.
0x0010: 0098 0878 4000 4006 501d 0a39 e6ac 0a39 ...x@.@.P..9...9
0x0020: e6ac 02c6 0801 df45 24a7 df46 6b6e 5018 .......E$..FknP.
0x0030: 8000 a3e3 0000 8000 006c 0631 d013 0000 .........l.1....
0x0040: 0000 0000 0002 0001 86a3 0000 0003 0000 ................
0x0050: 0001 0000 0001 0000 0020 49f0 4d05 0000 ..........I.M...
0x0060: 0008 656e 736e 6164 6136 0000 0000 0000 ..ensnada6......
0x0070: 0003 0000 0001 0000 0003 0000 0000 0000 ................
0x0080: 0000 0000 0020 4012 0001 ffff ffff 000a ......@.........
0x0090: 0000 0000 0002 0000 0000 000a 0000 0000 ................
0x00a0: 0002 0000 0000

The effect is that these packets are being sent once and again to the SERVER node, who righteously ignores them (no answer).
After several minutes, the client opens a new TCP:2049 connection (this time using the correct physical IP as src) and succeeds.
When failing, I have checked netstat -i, arp -a... in both nodes and everything is correct. The node acquiring the VIP does send the Gratuitous ARP...

I guess the problem might be:
- when node 1, being the SERVER, starts the CLIENT, a TCP connection like this is created:
tcp 0 0 10.57.230.172.2049 10.57.230.172.691 ESTABLISHED
tcp 0 0 10.57.230.172.691 10.57.230.172.2049 ESTABLISHED
i.e. with the VIP at both ends, although both the src & dst TCP sockets are in the same machine (src & dst MAC being the same and that of node1)
- after node1 stops the client and then the server, that TCP connection is not released
- then node2 starts the SERVER part, thus acquiring the VIP
- it sends a gratuitous ARP, that is received by node1. The TCP connection is not released, but the ARP cache is updated (VIP -> node2's MAC)
- node1 tries to mount a remote share from the the VIP (now node2)
- the UDP part works fine
- when coming to the TCP part, as node1 already has a TCP socket with destination the VIP, it uses this connection, therefore (node2's MAC) sending the GETATTR messages to node2
- node2 receives them but there is no socket at the TCP level corresponding to that connection so it just ignores them
- after some minutes, node1 closes this connection and opens a new one, this time real and using as src IP the physical address
- the rest of the TCP handshake takes place.

Questions:
1) Do you think the hypothesis makes sense?
2) What could be done to release the VIP-VIP connection? (I could try ndd but I think a more graceful approach may exist).
3) Related to 2: Which NFS process owns this connection? (It doesn't show up in lsof output but, should we able to identify it, maybe we could find a natural way to tell him to release the connection).
4) Why am I not able to see the 2049 TCP sockets in lsof output, even those coming from remote machines?

Are you still there?
Thanks for your patience.

Re: NFS VIP problem

Jose M. del Rio — Thu, 23 Apr 2009 15:41:00 GMT

Let's try again uploading the script.

Re: NFS VIP problem

Steven E. Protter — Thu, 23 Apr 2009 16:17:40 GMT

Shalom,

The key reading for us insomniacs is:

and after, say, 10' it succeeds.

There is a delay in the server coming on line.

Or we are going to the wrong serve because of an arp cache entry and after the cache is flushed the system is forced to get new information for the cache.

Specific answers:
1) Yes your theory is based on solid data. If those sockets were closed fail over might be faster.
2) Close the connections. NFS v4 has a different and better locking mechanism. To use NFS v4 you need 11.31. Also, you might do better with a simple NAS device, which is simpler to administer and better equipped for this job.

3) fuser -cu /filesystem_name
This will show you what processes are open on the file system. That will help in process identification.

4) Try netstat -an | grep 2049. Also, don't forget NFS opens a socket in a random port version 3 and below, which is why its so much fun to open up NFS on a firewall.

SEP

Re: NFS VIP problem

Jose M. del Rio — Thu, 23 Apr 2009 16:32:32 GMT

Hi Steven,
thanks for your prompt response.

>>>> There is a delay in the server coming on line.
>>>> Or we are going to the wrong serve because of an arp cache...
No.
The server comes online immediately, the ARP cache is immediately updated and the GETATTR packets are indeed received at the new server, as the sniffer traces in both nodes shows.

>>>> 3) fuser -cu /filesystem_name
In my tests, there is no one using the FS and yet the VIP-VIP connection survives.

>>>> 4) Try netstat -an | grep 2049
Yes. I'm using it. That's why I know there is something missing in the lsof output.

>>>> you might do better with a simple NAS device
Did you read my mind? There is one coming soon. In the meantime we have developed this most-of-the-times-working workaround.

Regards.

Re: NFS VIP problem

Dave Olker — Fri, 24 Apr 2009 02:09:08 GMT

Hi Jose,

Do you get the same behavior if you force all the NFS mounts to use UDP instead of TCP?

Regards,

Dave

Re: NFS VIP problem

Jose M. del Rio — Fri, 24 Apr 2009 08:19:13 GMT

Bingo!
No TCP connection created => no VIP-VIP TCP connection reused => no problem.
Thanks a lot.

topic Re: NFS VIP problem in Operating System - HP-UX

NFS VIP problem

Re: NFS VIP problem

Re: NFS VIP problem

Re: NFS VIP problem

Re: NFS VIP problem

Re: NFS VIP problem