totally bizarre NFS anomaly

support_5 · ‎09-09-2004

Hi Folks,

I have come upon a very very strange NFS problem. I need help here. Let me describe the problem to you:

we have a filesystem on server1 (named guam) exported to server2, and server2 has soft mounted it as read only.

when I go into the mount on server2, I can peruse the filesystem, cat files etc as per normal.

However, while I was doing this I came upon one file I could not cat, it returned a NFS error as follows:
NFS read failed for server guam: RPC: Timed out
cat: read error: No such file or directory

Initially, I thought that someone had umounted the filesystem or something, but no, I could cat every other file except this one. I've never heard of NFS failing on a single file, so it was strange.

I next thought it might be a permissions problem, so I examined the permissions on the file, but they were the same as every other file in the directory. I ran fuser on the file on server1 and no one was using it.

I ran the file command on the file, and it was like every other file in the directory, "commands text" (it is a ksh script, just like every other file in the directory).

So, here I am with a file I cannot cat, but I can to every other file in the directory. Now we will skip some time, because I tried many things that didn't work, but there was one thing I found that seemed to fix the problem:
If I deleted one (1) character from anywhere in the file, and saved it, I found that I was then able to cat the file on the NFS mount. How strange is that? I then found that if I added 5 more characters the file (4 more than it originally had), then I could also cat the file. But if I only added, like 2 or 3 characters it would fail.

So in essence, there was a range of 5 characters/bytes whereby NFS would fail to read the file, but any size outside that range, eithr higher or lower, it would work fine.

ie: the original file size was: 2825 bytes.
a file size of 2824 bytes would work
a file size of 2829 bytes would work
a file size of anything between 2824-2829 would fail (not including 2824 or 2829)
a file size greater than (or equal to) 2829 or less than (or equal to) 2824 would work.

Can someone please tell me why this would be the case? I am completely stumped...

- Andrew Gray

(I am running hpux 11.00, path level is at March 2004 bundle.)

support_5 · ‎09-09-2004

one more thing:

I found that I was always successful in doing a cat on the file when I was doing it on an automounted directory. ie cat /net/guam/myscripts/scriptwithproblem.ksh
would work (this was automounted)
but: cat /myscripts/scriptwithproblem.ksh
would fail (this was normal nfs mounted), even though they were the same file.

I meant to say the patch level was march 2004.

- Andrew Gray

Dave Olker · ‎09-09-2004

Hi Andrew,

That is a strange problem, but I've seen stranger. :)

I assume when you say you're running 11.0 that you mean the NFS client is 11.0. What type of system is the NFS server? Are both systems running HP-UX? If so, please issue the following command on each system for me:

# swlist -l product | grep ONC

and give me the output.

As for your specific problem, you said that when you try to cat the file from the manually mounted NFS filesystem it fails unless you first modify it, but when you cat the file from the automounted directory it works. Which automounter are you running on the 11.0 client - the legacy automounter or the ONC 1.2 AutoFS? If you're not sure, please copy/paste your /etc/rc.config.d/nfsconf file into this thread so that I can see how the client is configured. If the server is also an HP-UX system, give me that system's nfsconf file contents as well.

Finally, I'd like to see the output of the command:

# nfsstat -m

on the NFS client, and the output of the command:

# cat /etc/xtab

on the NFS server - assuming it is an HP-UX system.

All of this information will give me some ideas of where to go next.

Regards,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Sยภเl Kย๓คг · ‎09-09-2004

Can u go to server1 and move the perticular file to some other directory and see if it works.
regards
SK

Your imagination is the preview of your life's coming attractions

support_5 · ‎09-09-2004

Hi,

I will attach the information you wanted. As for your question re automount vs autofs. It appears we are using the old automount, since AUTOFS=0 in the nfsconf file. Perhaps we should change it to use the new autofs?? is it heaps better?

Both the client and server are running HP-UX 11.00.

Let me know if you still need the /etc/rc.config.d/nfsconf files from the client/server

Also, the directory I've been playing with is /apps/dv. that is the mount which contains the file I'm having trouble with.

I've also tried mounting the /apps/dv directory on other HP-UX hosts I have around, and tested to see if they exhibit the same behaviour. The result so far is that the other hosts do have the same behaviour, and some of these hosts are hpux 11.11 (11i) hosts too.

However, if I copy the file to another server, create an export on that server, and then mount it around the place, then it will work, and I am then able to cat the file. So perhaps the problem lies with the NFS server - guam. Don't know what though.

What do you think?

- Andy

Sridhar Bhaskarla · ‎09-09-2004

Hi Andy,

Can you do a 'cat -v file' on the NFS server and if you see any interesting characters?. Or what if you do

#cat file > file1

and then try on file1?

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Dave Olker · ‎09-09-2004

Hi Andy,

I didn't see the automounted filesystem in your nfsstat -m output, so I assume automounter simply unmounted it (as it is supposed to).

Since you're using the old automounter, it would have mounted the filesystem using NFS Version 2, whereas a manual mount will default to NFS Version 3. Both will use an 8K rsize/wsize by default on 11.0, and it appears you either have not enabled NFS/TCP on these systems or you're choosing to use UDP.

Should you use AutoFS? The ONC 1.2 AutoFS is not a very good version of AutoFS, but it is the only version available on 11.0. If you could update the client to 11i then you could download the ONC 2.3 version of AutoFS from http://software.hp.com, and that AutoFS is a far superior version of AutoFS than the ONC 1.2 version on 11.0 or the legacy automounter that you're using.

I'd still like to see the client and server's nfsconf files just to be safe.

The fact that other clients show the same behavior would seem to eliminate the possibility of a client cache corruption issue, which is what this problem originally sounded like.

I would be curious if this client, and the other clients, can successfully cat the file if you manually mount the filesystem with NFS Version 2. If you add the "vers=2" option to your mount syntax it should force NFS version 2.

You could even try creating a new empty directory on the initial client and mount the same filesystem from the server again using NFS version 2 into the new empty directory so that you'll have the same filesystem mounted twice on the same client - one with NFS V2 and one with NFS V3. That would be really interesting to see if the problem shows up in the V3 mount but not the V2 mount.

Let me know what this test reveals.

Thanks,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

support_5 · ‎09-09-2004

yes, I get the same problem whe I copy the file to another directory, and I also get the same problem when I copy the file to a different exported share on the nfs-server.

- Andy

Dave Olker · ‎09-09-2004

Hi again,

I just took a look at the latest 11.0 ONC patch and found this fix in it:

librpc.a
SR: 8606347226
DTS: JAGaf08050
Commands operating on an NFS file system mounted as a soft mount over UDP transport protocol fail with the error message: "RPC: Unable to receive".

This error is very similar to yours and you are using soft mounts and UDP, so this could be a match. If you are able to, I'd like you to install PHNE_30377 on the 11.0 NFS client (along with any dependent patches) and see if this affects the behavior of the client.

Again, I'm not one to typically "throw patches" at a problem, but the symptoms described in the patch text are pretty close to what you're seeing.

Regards,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

support_5 · ‎09-09-2004

hi,

as I've mentioned, I get the same problem when I copy the file to different directories, and when I copy the file to different nfs shares. I also get the same problem when I create a file from scratch which comes up to between 2824 and 2829 bytes. (I filled the file with 2825 a's).

As for patching:
I'll schedule some patching do be done. However, note that (as I said before), when I copy the file to other hosts and create an export on them, it seems to work fine. So we'll see.

Any other ideas?

- Andy

support_5 · ‎09-09-2004

Hi all,

thanks for your help so far. here is an update:

I don't know what did it, but I can no longer seem to duplicate the behaviour I was describing above. What I'm saying is that it seems to be working!

I was experimenting with adding the vers=2 line on the client and trying mounting etc, I know I also did an /sbin/init.d/nfs.client stop and start on the client also. But anyway, I can now cat the file. I didn't stop or restart anything on the server, but I can't get it to error like it was before. I don't know what I did to make it work properly again.

This is very strange! It behaves as if nothing was ever wrong! I'm stumped on this one.

- Andy

Dave Olker · ‎09-09-2004

How about on the other clients that were exhibiting the problem? Are they still seeing the same behavior or is it fixed everywhere?

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

support_5 · ‎09-09-2004

re you not seeing automount, I had change AUTOMOUNT=0 in the nfsconf file. You see I had changed it to AUTOMOUNT=1 last night and ran an /sbin/init.d/nfs.client start last night. There were those who thought that this change was what brought about this behaviour, so I was forced to regress the change. I regressed it, but it didn't help the situation. That is why you didn't see any automount stuff there. I have since turned it back on though.. ie automount is working again on the client.

- Andy

support_5 · ‎09-09-2004

It's working fine on the other clients also. Forgot to mention that. I restarted nfs.client on cook, but I can't get the other clients to error the way they were a moment ago either.

very strange.

- Andy

support_5 · ‎09-09-2004

I just double checked the processes on guam (the nfs-server), to see whether I did restart something there, but no, the processes are all dated from July...

[guam]:/root # ps -ef | grep -e nfs -e biod -e rpc
root 1507 1 0 Jul 26 ? 0:21 /usr/sbin/biod 4
root 1483 1 0 Jul 26 ? 0:04 /usr/sbin/rpcbind
root 1488 0 0 Jul 26 ? 0:00 nfskd
root 1525 1 0 Jul 26 ? 0:04 /usr/sbin/rpc.lockd
root 1508 1 0 Jul 26 ? 0:21 /usr/sbin/biod 4
root 1509 1 0 Jul 26 ? 0:21 /usr/sbin/biod 4
root 1510 1 0 Jul 26 ? 0:21 /usr/sbin/biod 4
root 1519 1 0 Jul 26 ? 0:03 /usr/sbin/rpc.statd
root 1939 1 0 Jul 26 ? 4:36 /opt/dce/sbin/rpcd
root 12890 12885 0 Jul 31 ? 2:36 /usr/sbin/nfsd 4
root 12887 12885 0 Jul 31 ? 2:41 /usr/sbin/nfsd 4
root 12874 1 0 Jul 31 ? 0:02 /usr/sbin/rpc.mountd
root 3821 29041 1 14:56:29 pts/9 0:00 grep -e nfs -e biod -e rpc
root 12888 12885 0 Jul 31 ? 2:37 /usr/sbin/nfsd 4
root 12885 1 0 Jul 31 ? 2:34 /usr/sbin/nfsd 4
daemon 1859 1567 0 Jul 26 ? 0:03 rpc.cmsd
root 1585 1567 0 Jul 26 ? 0:03 /usr/dt/bin/rpc.ttdbserver
root 12889 12885 0 Jul 31 ? 2:38 /usr/sbin/nfsd 4
root 12900 1 0 Jul 31 ? 0:02 /usr/sbin/rpc.pcnfsd
root 12886 12885 0 Jul 31 ? 2:41 /usr/sbin/nfsd 4

support_5 · ‎09-09-2004

I asked my co-workers about it, and my workmate told me that he had restarted SNMP, and restarted Cold-Fusion (web development stuff) on guam. Doubt that had anything to do with it, but it was around a similar time-frame that this started working by itself again. could be a co-incedence, who knows.

- Andy

support_5 · ‎09-09-2004

guam=nfs-server

support_5 · ‎09-09-2004

This is so weird, I'm beginning to doubt my sanity about this. But no, I have confirmed with my co-worker (who has watched as we have progressed in this problem) that I am indeed sane, and that 10 minutes ago, there was indeed a problem with files of that size (ie 2858 bytes).

Any ideas as to what is going on?

- Andy

Dave Olker · ‎09-09-2004

Andy,

The times I've seen behavior like this in the past, where a specific sized file fails, it has been a couple of things:

1. Client cache corruption

This wouldn't explain why it fails on more than one client

2. UDP checksum failures

Check the "netstat -p udp" output on all of the systems involved to see if any UDP checksum failures are logged.

3. Network problem

Some piece of intermediate network hardware is corrupting specific packets - usually base on a certain size of the packet or byte alignment. This would explain why adding a few bytes or deleting a few bytes from the file in question would get it to start working again.

If you don't see any UDP checksum failures on any of the systems, my best guess (without looking at any other data) would be #3 - that some piece of network equipment was dropping packets based on a byte alignment issue. This would explain why multiple clients saw the same behavior - assuming they use the same networking hardware (i.e. switches, hubs, routers) to communicate with the NFS server.

Again, pure conjecture at this point since the problem isn't reproducing any more so we can't collect network traces to verify whether packets are arriving intact between the clients and servers.

Hint - that would have been my next suggestion if the problem was still happening.

Regards,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

support_5 · ‎09-09-2004

well cook had 299 bad checksums. I don't know if this is a lot, but other boxes on the site have either 0 or very few (like 5 or 6) bad checksums. Guam had no bad checksums.

An interesting point to ponder though. You know when I went out and tried mounting on various other hosts around the site to see if it happened on those also? Well all the boxes I did this on all have like 5 or 6 bad checksums, all the ones I didn't test this on, all have 0 checksums, and of course cook has heaps.

So this would indicate that it was a networking issue, which is quite possible since the network guys had done some networking work last night, which may have stuffed things up.

This is very interesting, do you have any more information on this available to you?

Thanks for the help!!! I Really appreciate it.

- Andrew Gray

Dave Olker · ‎09-09-2004

Hi Andy,

The fact that all of the clients you tested with have logged checksum failures is a strong indication that something in the network is/was corrupting UDP packets. This would definitely explain the problem, and it fits with other cases I've seen in the past where the UDP checksum failures only occur for certain packets and not others.

You said:
__________________________________________

So this would indicate that it was a networking issue, which is quite possible since the network guys had done some networking work last night, which may have stuffed things up.
__________________________________________

What kind of networking "work" did the network guys do last night? Were they still making changes earlier this evening (or morning, depending upon where you are)? Did someone from the network team reset a router/hub/bridge/switch during your test and that's why the problem went away?

Bottom line, I really doubt you'll be able to pin point the exact cause of the problem unless it occurs again. If it does, I recommend taking a series of network traces to see which packets make it from the client to server and back and which ones don't.

You'll likely need need to take traces at various points in the network to figure out which hop in the network is causing the packet corruption (assuming there are multiple hops between the client and server). After enough tracing, you should be able to identify the device causing the failures and then get someone to correct it. However, it appears to have corrected itself (or someone corrected it without your knowledge) already.

Best regards,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Dave Olker · ‎09-09-2004

Hi Andy,

One other suggestion to consider for the future...

HP-UX 11.0 supports NFS/TCP. It is very possible that a TCP mount would not have shown this same problem since some networking hardware tends to treat UDP and TCP traffic differently in these cases. Also, since UDP and TCP headers are different sizes, it's likely that a TCP packet wouldn't have hit the same "magic" byte size that caused the UDP corruption to occur.

If you're interested in trying NFS/TCP on your 11.0 systems, you would simply issue the following command on both your NFS client and server systems:

# setoncenv NFS_TCP 1

Once you issue this command you either need to reboot the systems or stop/restart all NFS services on the system. If you're not using any NFS services at the time you can do an:

# /sbin/init.d/nfs.server stop
# /sbin/init.d/nfs.client stop

# /sbin/init.d/nfs.client start
# /sbin/init.d/nfs.server start

This will halt and restart all of the necessary NFS daemons with support for NFS/TCP. Also, once you enable TCP on an 11.0 system it becomes the default protocol used for future NFS mounts, unless of course you're either overriding the protocol by using the "proto=udp" mount option, or by using the legacy automounter, which only supports NFS Version 2 mounts using UDP.

Best of luck,

Dave

I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

support_5 · ‎09-09-2004

Hi,

I talked to the network fellas. Apparently they had upgraded the Cisco IOS (operating system) on the Layer 3 switch that the nfs server and client are both connected to. They had to reboot the L3 switch last night. They also said that they have done nothing with it all day, so I don't know why it would suddently start working.

We wondered if maybe it was a duplex issue, since we have had them in the past, but no, everything seems to be running full-duplex where it should.

Does NFS have anything to do with SNMP? The only other thing I can think of that happened about the time it started working was someone restarted SNMP on the nfs-server. I don't know how such a thing could affect network transmission check-sums though.

the NFS server and NFS client are both connected to the switch, but I don't see how a switch would be manipulating packets??? How does that happen? I could almost understand it if it was a router, but a switch??? Do you know how a switch could manupulate packets? The NFS server and client are actually plugged in right next to each other on the switch, same subnet, same vlan etc.

Any other information?

Thanks.

- Andrew Gray

H.Merijn Brand (procura · ‎09-09-2004

Great show Dave!

What are the advantages of NFS/TCP over NFS/UDP ?
We use NFS a lot, and I don't know the diff, and I never bother until now. Should I?
FWIW all our servers (HP-UX 10.20, HP-UX 11.00, HP-UX 11.11, AIX-4.3.3, and AIX-5.2.0) have cross mounted NFS all their file systems in both directions :) Sometimes Linux clients also mount any of those.
No auto-mount involved. All manual mounts.

Enjoy, Have FUN! H.Merijn

Enjoy, Have FUN! H.Merijn

Sridhar Bhaskarla · ‎09-09-2004

Hi Andrew,

Then I suggest you start doing some tracing at this time..

Look at loading 'ethereal' on the client (may be your workstaion so you won't affect your production servers) and mount the filesystem.

You can get ethreal from HP's porting center. You will need to look at it's dependencies.. May be little hard to get it up and running but once it is running, it's a beauty. You can trace the packets only between these two servers (using tcpdump's packet filters) and see what's happening.

http://hpux.connect.org.uk/hppd/hpux/Networking/Admin/tcpdump-3.8.3/

From the server side, use 'tusc' with the 'cat' command and see if you get any clues out of it.

http://hpux.connect.org.uk/hppd/hpux/Sysadmin/tusc-7.5/

Even if you don't use them, then are good to have tools.

You can use tcpdump or built-in nettl (nettladm) to capture the packets.. but I personally like ethreal as it has a nice GUI.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

totally bizarre NFS anomaly

totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly

Re: totally bizarre NFS anomaly