<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: NFS VIP problem in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406820#M536076</link>
    <description>Shalom,&lt;BR /&gt;&lt;BR /&gt;The key reading for us insomniacs is:&lt;BR /&gt;&lt;BR /&gt;and after, say, 10' it succeeds.&lt;BR /&gt;&lt;BR /&gt;There is a delay in the server coming on line.&lt;BR /&gt;&lt;BR /&gt;Or we are going to the wrong serve because of an arp cache entry and after the cache is flushed the system is forced to get new information for the cache.&lt;BR /&gt;&lt;BR /&gt;Specific answers:&lt;BR /&gt;1) Yes your theory is based on solid data. If those sockets were closed fail over might be faster.&lt;BR /&gt;2) Close the connections. NFS v4 has a different and better locking mechanism. To use NFS v4 you need 11.31. Also, you might do better with a simple NAS device, which is simpler to administer and better equipped for this job.&lt;BR /&gt;&lt;BR /&gt;3) fuser -cu /filesystem_name &lt;BR /&gt;This will show you what processes are open on the file system. That will help in process identification.&lt;BR /&gt;&lt;BR /&gt;4) Try netstat -an | grep 2049. Also, don't forget NFS opens a socket in a random port version 3 and below, which is why its so much fun to open up NFS on a firewall.&lt;BR /&gt;&lt;BR /&gt;SEP</description>
    <pubDate>Thu, 23 Apr 2009 16:17:40 GMT</pubDate>
    <dc:creator>Steven E. Protter</dc:creator>
    <dc:date>2009-04-23T16:17:40Z</dc:date>
    <item>
      <title>NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406818#M536074</link>
      <description>&lt;!--!*#--&gt;Hi,&lt;BR /&gt;Some reading for the insomniacs:&lt;BR /&gt;In order to share some filesystems among several servers, we are using a script derived from those in "MC/ServiceGuard NFS" (&lt;A href="http://docs.hp.com/en/ha.html#Highly%20Available%20NFS)." target="_blank"&gt;http://docs.hp.com/en/ha.html#Highly%20Available%20NFS).&lt;/A&gt; That is:&lt;BR /&gt;- we use a VIP that is assigned to one of the nodes&lt;BR /&gt;- that node activates the VG, mounts locally its filesystems and exports them to the rest&lt;BR /&gt;- the rest of the nodes (and the node itself) mount those filesystems from the VIP via NFS, as in the "Server-to-Server Cross-Mounts" option in ServiceGuard NFS.&lt;BR /&gt;&lt;BR /&gt;The SERVER or CLIENT roles of the nodes can be switched using this script (attached).&lt;BR /&gt;The problem is: from time to time, after a role change, the clients are not able to mount via NFS the remote shares.&lt;BR /&gt;For instance, if the node names are e5 and e6 and:&lt;BR /&gt;- e6 is acting both as server &amp;amp; client&lt;BR /&gt;- e5 is acting as client&lt;BR /&gt;then, after issuing:&lt;BR /&gt;e5: &lt;BR /&gt;./nfs_Catastro.cntl stop_client&lt;BR /&gt;&lt;BR /&gt;e6: &lt;BR /&gt;./nfs_Catastro.cntl stop_client&lt;BR /&gt;./nfs_Catastro.cntl stop_server&lt;BR /&gt;&lt;BR /&gt;e5:&lt;BR /&gt;./nfs_Catastro.cntl start_server&lt;BR /&gt;./nfs_Catastro.cntl start_client&lt;BR /&gt;&lt;BR /&gt;e6:&lt;BR /&gt;./nfs_Catastro.cntl start_client&lt;BR /&gt;&lt;BR /&gt;the latter start_client fails to mount immediately the first remote share (a manual "mount colada_nfs:/u6_local /u6" also fails).&lt;BR /&gt;It keeps trying and issuing messages like:&lt;BR /&gt;NFS server colada_nfs not responding still trying&lt;BR /&gt;NFS server colada_nfs not responding still trying&lt;BR /&gt;NFS server colada_nfs not responding still trying&lt;BR /&gt;...&lt;BR /&gt;&lt;BR /&gt;and after, say, 10' it succeeds.&lt;BR /&gt;I have enabled NFS logging but nothing revealing shows in the logs. I have sniffed and a succeeding sequence is:&lt;BR /&gt;&lt;BR /&gt;UDP:&lt;BR /&gt;e6 -&amp;gt; e5:111 GETPORT MOUNTD (100005)&lt;BR /&gt;   &amp;lt;- 57585&lt;BR /&gt;e6 -&amp;gt; e5:57585 NULL call&lt;BR /&gt;   &amp;lt;-          NULL reply&lt;BR /&gt;e6 -&amp;gt; e5:57585 MNT /u6_local&lt;BR /&gt;   &amp;lt;-          OK, filehandle = ...&lt;BR /&gt;e6 -&amp;gt; e5:111 GETPORT NFS (100003)&lt;BR /&gt;   &amp;lt;- 2049&lt;BR /&gt;&lt;BR /&gt;TCP:&lt;BR /&gt;e6 -&amp;gt; e5:2049 NULL call&lt;BR /&gt;   &amp;lt;-         NULL reply&lt;BR /&gt;e6 -&amp;gt; e5:2049 GETATTR filehandle = ... (*)&lt;BR /&gt;   &amp;lt;-         directory mode:0755 uid:0 gid:0&lt;BR /&gt;e6 -&amp;gt; e5:2049 FSINFO filehandle = ...&lt;BR /&gt;   &amp;lt;-         max file size, supports symbolic links...&lt;BR /&gt;...&lt;BR /&gt;&lt;BR /&gt;In a failing one, the first part (UDP) works fine: mountd.log shows the requests being immediately granted:&lt;BR /&gt;     rpc.mountd: mount: mount request from ensnada6 granted.&lt;BR /&gt;However, as for the TCP part, the packet marked with (*) includes:&lt;BR /&gt;- as source IP from the client node, the VIP!, which no longer is assigned to any interface in that node (in fact, the ifconfig lanX:N 0.0.0.0 removed the secondary interface)&lt;BR /&gt;- as destination IP, the VIP, which is correct and correctly assigned to the new SERVER node&lt;BR /&gt;- both src and dst MAC addresses are correct.&lt;BR /&gt;&lt;BR /&gt;13:12:05.882737 IP (tos 0x0, ttl  64, id 2168, offset 0, flags [DF], length: 152&lt;BR /&gt;) colada_nfs.cata.103927827 &amp;gt; colada_nfs.cata.nfs: 112 getattr fh 4100,131073/2&lt;BR /&gt;        0x0000:  0018 7100 f026 0013 21ea 2745 0800 4500  ..q..&amp;amp;..!.'E..E.&lt;BR /&gt;        0x0010:  0098 0878 4000 4006 501d 0a39 e6ac 0a39  ...x@.@.P..9...9&lt;BR /&gt;        0x0020:  e6ac 02c6 0801 df45 24a7 df46 6b6e 5018  .......E$..FknP.&lt;BR /&gt;        0x0030:  8000 a3e3 0000 8000 006c 0631 d013 0000  .........l.1....&lt;BR /&gt;        0x0040:  0000 0000 0002 0001 86a3 0000 0003 0000  ................&lt;BR /&gt;        0x0050:  0001 0000 0001 0000 0020 49f0 4d05 0000  ..........I.M...&lt;BR /&gt;        0x0060:  0008 656e 736e 6164 6136 0000 0000 0000  ..ensnada6......&lt;BR /&gt;        0x0070:  0003 0000 0001 0000 0003 0000 0000 0000  ................&lt;BR /&gt;        0x0080:  0000 0000 0020 4012 0001 ffff ffff 000a  ......@.........&lt;BR /&gt;        0x0090:  0000 0000 0002 0000 0000 000a 0000 0000  ................&lt;BR /&gt;        0x00a0:  0002 0000 0000                           &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The effect is that these packets are being sent once and again to the SERVER node, who righteously ignores them (no answer).&lt;BR /&gt;After several minutes, the client opens a new TCP:2049 connection (this time using the correct physical IP as src) and succeeds.&lt;BR /&gt;When failing, I have checked netstat -i, arp -a... in both nodes and everything is correct. The node acquiring the VIP does send the Gratuitous ARP...&lt;BR /&gt;&lt;BR /&gt;I guess the problem might be:&lt;BR /&gt;- when node 1, being the SERVER, starts the CLIENT, a TCP connection like this is created:&lt;BR /&gt;tcp        0      0  10.57.230.172.2049     10.57.230.172.691       ESTABLISHED&lt;BR /&gt;tcp        0      0  10.57.230.172.691      10.57.230.172.2049      ESTABLISHED&lt;BR /&gt;i.e. with the VIP at both ends, although both the src &amp;amp; dst TCP sockets are in the same machine (src &amp;amp; dst MAC being the same and that of node1)&lt;BR /&gt;- after node1 stops the client and then the server, that TCP connection is not released&lt;BR /&gt;- then node2 starts the SERVER part, thus acquiring the VIP&lt;BR /&gt;- it sends a gratuitous ARP, that is received by node1. The TCP connection is not released, but the ARP cache is updated (VIP -&amp;gt; node2's MAC)&lt;BR /&gt;- node1 tries to mount a remote share from the the VIP (now node2)&lt;BR /&gt;- the UDP part works fine&lt;BR /&gt;- when coming to the TCP part, as node1 already has a TCP socket with destination the VIP, it uses this connection, therefore (node2's MAC) sending the GETATTR messages to node2&lt;BR /&gt;- node2 receives them but there is no socket at the TCP level corresponding to that connection so it just ignores them&lt;BR /&gt;- after some minutes, node1 closes this connection and opens a new one, this time real and using as src IP the physical address&lt;BR /&gt;- the rest of the TCP handshake takes place.&lt;BR /&gt;&lt;BR /&gt;Questions:&lt;BR /&gt;1) Do you think the hypothesis makes sense?&lt;BR /&gt;2) What could be done to release the VIP-VIP connection? (I could try ndd but I think a more graceful approach may exist).    &lt;BR /&gt;3) Related to 2: Which NFS process owns this connection? (It doesn't show up in lsof output but, should we able to identify it, maybe we could find a natural way to tell him to release the connection).&lt;BR /&gt;4) Why am I not able to see the 2049 TCP sockets in lsof output, even those coming from remote machines?&lt;BR /&gt;&lt;BR /&gt;Are you still there?&lt;BR /&gt;Thanks for your patience.</description>
      <pubDate>Thu, 23 Apr 2009 15:37:35 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406818#M536074</guid>
      <dc:creator>Jose M. del Rio</dc:creator>
      <dc:date>2009-04-23T15:37:35Z</dc:date>
    </item>
    <item>
      <title>Re: NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406819#M536075</link>
      <description>Let's try again uploading the script.</description>
      <pubDate>Thu, 23 Apr 2009 15:41:00 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406819#M536075</guid>
      <dc:creator>Jose M. del Rio</dc:creator>
      <dc:date>2009-04-23T15:41:00Z</dc:date>
    </item>
    <item>
      <title>Re: NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406820#M536076</link>
      <description>Shalom,&lt;BR /&gt;&lt;BR /&gt;The key reading for us insomniacs is:&lt;BR /&gt;&lt;BR /&gt;and after, say, 10' it succeeds.&lt;BR /&gt;&lt;BR /&gt;There is a delay in the server coming on line.&lt;BR /&gt;&lt;BR /&gt;Or we are going to the wrong serve because of an arp cache entry and after the cache is flushed the system is forced to get new information for the cache.&lt;BR /&gt;&lt;BR /&gt;Specific answers:&lt;BR /&gt;1) Yes your theory is based on solid data. If those sockets were closed fail over might be faster.&lt;BR /&gt;2) Close the connections. NFS v4 has a different and better locking mechanism. To use NFS v4 you need 11.31. Also, you might do better with a simple NAS device, which is simpler to administer and better equipped for this job.&lt;BR /&gt;&lt;BR /&gt;3) fuser -cu /filesystem_name &lt;BR /&gt;This will show you what processes are open on the file system. That will help in process identification.&lt;BR /&gt;&lt;BR /&gt;4) Try netstat -an | grep 2049. Also, don't forget NFS opens a socket in a random port version 3 and below, which is why its so much fun to open up NFS on a firewall.&lt;BR /&gt;&lt;BR /&gt;SEP</description>
      <pubDate>Thu, 23 Apr 2009 16:17:40 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406820#M536076</guid>
      <dc:creator>Steven E. Protter</dc:creator>
      <dc:date>2009-04-23T16:17:40Z</dc:date>
    </item>
    <item>
      <title>Re: NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406821#M536077</link>
      <description>Hi Steven,&lt;BR /&gt;thanks for your prompt response.&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; There is a delay in the server coming on line.&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Or we are going to the wrong serve because of an arp cache...&lt;BR /&gt;No. &lt;BR /&gt;The server comes online immediately, the ARP cache is immediately updated and the GETATTR packets are indeed received at the new server, as the sniffer traces in both nodes shows.&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; 3) fuser -cu /filesystem_name &lt;BR /&gt;In my tests, there is no one using the FS and yet the VIP-VIP connection survives.&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; 4) Try netstat -an | grep 2049&lt;BR /&gt;Yes. I'm using it. That's why I know there is something missing in the lsof output.&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; you might do better with a simple NAS device&lt;BR /&gt;Did you read my mind? There is one coming soon. In the meantime we have developed this most-of-the-times-working workaround.&lt;BR /&gt;&lt;BR /&gt;Regards.</description>
      <pubDate>Thu, 23 Apr 2009 16:32:32 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406821#M536077</guid>
      <dc:creator>Jose M. del Rio</dc:creator>
      <dc:date>2009-04-23T16:32:32Z</dc:date>
    </item>
    <item>
      <title>Re: NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406822#M536079</link>
      <description>Hi Jose,&lt;BR /&gt;&lt;BR /&gt;Do you get the same behavior if you force all the NFS mounts to use UDP instead of TCP?&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;&lt;BR /&gt;Dave</description>
      <pubDate>Fri, 24 Apr 2009 02:09:08 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406822#M536079</guid>
      <dc:creator>Dave Olker</dc:creator>
      <dc:date>2009-04-24T02:09:08Z</dc:date>
    </item>
    <item>
      <title>Re: NFS VIP problem</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406823#M536081</link>
      <description>Bingo!&lt;BR /&gt;No TCP connection created =&amp;gt; no VIP-VIP TCP connection reused =&amp;gt; no problem.&lt;BR /&gt;Thanks a lot.</description>
      <pubDate>Fri, 24 Apr 2009 08:19:13 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/nfs-vip-problem/m-p/4406823#M536081</guid>
      <dc:creator>Jose M. del Rio</dc:creator>
      <dc:date>2009-04-24T08:19:13Z</dc:date>
    </item>
  </channel>
</rss>

