Weird NFS issue

Chetan_5 · ‎07-28-2008

We have an NFS server and 2 clients in our production environment, all running v3. The problem is when a job runs on one server, the file is sometimes not immediately available on the other client. The size, mod time etc all look the same, but a cksum doesnt.

Trying to re-run the job after an hour has the same issue. Used lsof to check if some process had the file open, there was none. After I remount the filesystem, the file is updated automatically. I tried the noac option to disable caching, that did not help.

We had the same issue when our NFS server was v2. After thinking it might be a version incompatibilty issue, we moved the server to a v3 environment, but no luck.

Is this a bug? Anyone seen this before?

Dennis Handly · ‎07-28-2008

I had something like that several times. It depended on the automounter and block sizes.
"nfsstat -m" showed the bad ones using vers=2.

Looking at the raw file showed a "block" copied twice, with the rest of the file shoved down.

Fabio Ettore · ‎07-29-2008

Hi,

is the NFS server a SG package?

Best regards,
Fabio

WISH? IMPROVEMENT!

Chetan_5 · ‎07-29-2008

Here are the options for the NFS mount
Flags: vers=3,proto=tcp,sec=sys,hard,intr,noac,link,symlink,acl,devs,rsize=32768,wsize=32768,retrans=5,timeo=600
Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

Fabio,

You are correct, the NFS server has the filesystem in a MC/SG package.

Dennis Handly · ‎07-30-2008

>Here are the options for the NFS mount flags: vers=3,proto=tcp,...,rsize=32768,wsize=32768,

The 3 is fine. I'm not sure about rsize and wsize.

How big is the corrupted file in question?
Is this still happening?

Fabio Ettore · ‎07-30-2008

Hi,

I think the problem is related to the TCP protocol used rather than UDP protocol.
See this note in the manual for SG/NFS Toolkit:

http://docs.hp.com/en/B5140-90035/B5140-90035.pdf

Pag.10:

If a server is configured to use NFS over TCP and the client is the same machine
as the server, which results in a loopback NFS mount, the client may hang for
about 5 minutes if the package is moved to another node. The solution is to use
NFS over UDP between NFS-HA-server cross mounts.
â ¢ The/etc/rmtabfile is not synchronized when an NFS package fails over to the
standby node. This is caused by the design of NFS, which does not keep track of
the state of thermtab. The man page for rmtabcontains a warning that it is not
always totally accurate, so it is also unreliable in a standard NFS server / NFS client
environment.
â ¢ AutoFS mounts may fail when mounting file systems exported by an HA-NFS
package soon after that package has been restarted. To avoid these mount failures,
AutoFS clients should wait at least 60 seconds after an HA-NFS package has started
before mounting file systems exported from that package.

Try the following to demonstrate the problem is really that:

- switch the NFS package to the second node;
- on the first node try to mount the NFS filesystem and wait around 8/10 minutes without stopping it, I suppose it will succeed to mount it after that period.

Anyway the solution is to use on NFS clients the mount option proto=udp.

Please let me know if something is not clear.

HTH.

Best regards,
Fabio

WISH? IMPROVEMENT!

Chetan_5 · ‎07-30-2008

Files are not that big, 50-100KB...

Fabio,

This does not apply in my case. The client is not part of the cluster.

Chetan

Dennis Handly · ‎07-30-2008

>Files are not that big, 50-100KB...

If you can tar up the two files and attach them, I can look at them.

Basically you need the original file, use ftp to copy it. Then the copy from over NFS that is bad. Check the cksums first.

Chetan_5 · ‎07-31-2008

I will try to attach them ...Basically the files look the same, the bad file has control characters at the last 5-6 records. So that when the app tries to process the file it bombs. The cksum is different for the good and the bad file ..

CLIENTS:
psapap01:/PRD/mm/iv/out#>cksum zmif1704pdt.err
1668511218 51705 zmif1704pdt.err
psapap01:/PRD/mm/iv/out#>ll zmif1704pdt.err
-rw-rw-rw- 1 vdumpit tech 51705 Jul 30 12:58 zmif1704pdt.err

psapap02:/PRD/mm/iv/out#>cksum zmif1704pdt.err
307133814 51705 zmif1704pdt.err
psapap02:/PRD/mm/iv/out#>ll zmif1704pdt.err
-rw-rw-rw- 1 vdumpit tech 51705 Jul 30 12:58 zmif1704pdt.err

SERVER:
psapcl02:/PRD/mm/iv/out#>cksum zmif1704pdt.err
307133814 51705 zmif1704pdt.err
psapcl02:/PRD/mm/iv/out#>ll zmif1704pdt.err
-rw-rw-rw- 1 vdumpit tech 51705 Jul 30 12:58 zmif1704pdt.err

Dennis Handly · ‎07-31-2008

>the bad file has control characters at the last 5-6 records.

Hmm, mine was bad near the start.
You may want to dump the file in hex then compare:
xd -tx4 zmif1704pdt.err > zmif1704pdt.xd

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Weird NFS issue

Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue

Re: Weird NFS issue