Operating System - HP-UX
1823376 Members
2456 Online
109654 Solutions
New Discussion юеВ

Re: Performance problem, perhaps nfs

 
Stefan Schulz
Honored Contributor

Performance problem, perhaps nfs

I have interesting performance problem on my main fileserver here. I hope somebody can help me here.

The system:
D380/1 with 384 MB RAM running HP-UX 10.20. It has an RAID5 System from MTI with 350 GB capacity attached. The system is on a Seagate ST39173WC and there is an additional data disk (also a ST39173WC).

The symptoms:
Several times a day, following no schema, i do get very high loads on this server. The load showed by top gets up to 10 or even higher for several minutes and then drops to its normal value below 1. During this high load times the top command shows all nfsd deamons eating up the CPU time.

As the top command showed all those nfsd I had a look at the nfs performance tuning document I found at docs.hp.com. But i can't find any problems with nfs.

Here is what i checked:

It says: check with netstat -m if the "requests for memory denied is high". This value is zero.

It says: check with vmstat -n "if the us an sy values under cpu are high and the id value is close to zero". The us and sy values are close to zero and id is at 95.

It says: check with nfsstat -s "if readlink is the same magnitude as lookup calls". Readlink is only 2% of lookup.

It says: check with nfsstat -s "if getattr is greatern than 60%". But getattr is at only 3%.

It says: check with netstat -s "if the udp socket overflows is high". This value is also zero.

This server is only a fileserver, there is not database or any other application running. Also normally there is no user logged in.

I increased the number of nfsd from 32 to 64 but the symptoms are still the same. As i have only four filesystems exported this value seems to high, but reducing doesn't help either. The values i have set in /etc/rc.config.d/nfsconf are:

NFS_CLIENT=1
NFS_SERVER=1
NUM_NFSD=64
NUM_NFSIOD=4
PCNFS_SERVER=0
AUTOMOUNT=0

I have a snapshot of the top command attached. Please somebody give me a tip what's going on at this system. Any tip is greatly appreciated.

Regards Stefan
No Mouse found. System halted. Press Mousebutton to continue.
14 REPLIES 14
Michael Tully
Honored Contributor

Re: Performance problem, perhaps nfs

Hi Stefan,

Dumb questions for you.....
What is the disk i/o like during this peak?
What does glance report for disk activity?
Are there any automatic file copies from other
servers, cron jobs etc?
Check the syslog for any disk related error
messages.

I don't think increasing the number of nfsd
processes will help if your cpu already has a
high CPU usage and load.

Cheers
Michael
Anyone for a Mutiny ?
Stefan Schulz
Honored Contributor

Re: Performance problem, perhaps nfs

Hi Michael,

i do have a lot of disk i/o the whole day. Til now i could'nt find something unusual during such a peak, but i will have a closer look again.

There is no glance installed on this server, and i would like not to install glance. All i could do would be to install the trial version as i don't get the money to buy glance.

There are no automatic file copies. But i will check if the load is that high during the nightly backups.

Im sorry that i didn't mention that i already checked the logfiles for errors. There are no errors relating to disks or network/nfs.

I tried to check if this is a disk i/o related problem. So i moved and copied several GB of data to and from this server. During the write tests to this server the load did increase a lot but didn't reach the high peaks.

Some more info to our environment. This server holds files from Interleaf (Documentation), ME10 (CAD) and Promis (ECAD). Only the Interleaf files can reach sizes up to 100 MB. All other files are usually at about 1MB. The Clients are connected with 10MBit to the Network where the servers have a 100MBit connection.

I will check for the disk activity during the next peak with sar -d. What would you consider a high or problematic i/o value?

Any more ideas?

Regards Stefan

PS: i will assign points later when i know how much your answer helped.
No Mouse found. System halted. Press Mousebutton to continue.
Vincent Stedema
Esteemed Contributor

Re: Performance problem, perhaps nfs

Hi Stefan,

How many clients are you servicing with this NFS server? And, do you use any specific parameters for the NFS mounts? You might want to look into the possibility of changing the wsize and rsize parameters, especially with the file sizes you mentioned (1 to 100 MB).

Regards,

Vincent
Michael Tully
Honored Contributor

Re: Performance problem, perhaps nfs

Hi Stefan,

You can install trial copy of glance from your application CD set if you have them. (providing you have sufficient disk space in which to do so) This way you can track down what is causing your disk IO load.

HTH
Michael
Anyone for a Mutiny ?
Stefan Schulz
Honored Contributor

Re: Performance problem, perhaps nfs

Hi Vincent,

At the moment we have about 120 Clients connected to this server.
We don't use any specific parameters. The /etc/exports looks like this:

/kedaten -anon=65534,root=ecad_024
/daten -anon=65534,root=dotserv,root=cadserv
/daten2 -anon=65534,root=dotserv,root=cadserv
/daten3 -anon=65534,root=dotserv,root=cadserv

The /etc/fstab of the clients look like this:

keserv:/daten/dotdaten /dotdaten nfs rw,suid 0 0
keserv:/daten2/caddaten /caddaten nfs rw,suid 0 0

In which way/with which parameters would you modify the wsize/rsize parameters?

Of course not all clients have all exported filesystems mounted.

Hi Michael,

i know about the trial version of glance. But disk space is a real problem with this server. Also i don't like installing/removing trial software on a production server. But i will do if this is the only way to track down the problem.

Regards Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Vincent Stedema
Esteemed Contributor

Re: Performance problem, perhaps nfs

Hi Stefan,

The rsize and wsize parameters determine which block size is used to read from and write to NFS mounts. I'm not completely sure, but I think the default block size is 8 kb. As your clients load rather large files from the NFS shares this means that a single read will trigger multiple reads at the server. The same goes for writes, of course. If you, however, increase the wsize and rsize parameters, less requests will be needed to read or write a file.

On the other hand, increasing wsize and rsize also increases the probability that a single request will fail due to the UDP nature of the NFS mount and the fact that more IP fragments are needed to transfer the read/write request over the network.

I don't know if adjusting these parameters will make any difference in your case. I agree with Michael that 64 biods is a rather large number and it could be that this is causing the high load on your machine.

Regards,

Vincent
Shannon Petry
Honored Contributor

Re: Performance problem, perhaps nfs

Another thing to look at is the clients, and not the server. I would recommend that you try to modify a few clients (easy if your automounting) and change the read/write buffers size to a small value. I.E
add "rsize=2048,wsize=2048" to each clients mount options.

While this may not seem like a good way to tackle the problem, the default buffer is huge. This means that if there are any network problems the retransmits are then also huge. It also means that many clients are waiting for their data, instead of a even distribution of smaller amounts of data.

See if it helps!

Regards,
Shannon
Microsoft. When do you want a virus today?
Vincent Fleming
Honored Contributor

Re: Performance problem, perhaps nfs

You might be seeing users bashing the system by doing things like using "find", or several users writing large (100MB, as you said) files to the server at the same time. Take a close look at the network activity during one of these slowdowns, and again when things are more normal, and see if there is a very large increase in network traffic during the slow periods. If so, then you might try adding network interfaces to the machine.

Your 'top' output shows very high system cpu utilization, too. This makes me think of "find" again. I can do that, as it's searching directories, not transferring lots of actual data.

Good luck
No matter where you go, there you are.
Stefan Schulz
Honored Contributor

Re: Performance problem, perhaps nfs

Hello again,

sorry it took me so long to get back, but i had to test your suggestions first. Unfortunately the problem still exists :-(

Vincent and Shannon:

playing with the rsize and wsize parameters had no effect. Also as far as i learned in the meantime these parameters should match the blocksize of the exported HFS directories. So i put them back to 8kB.

Vincent:

This Server is a fileserver only, there are no users logged in. I couldn't find any processes like find or anything else. Also i couldn't find a client doing some excessive searches on the mounted directories.

What i did /found in the meantime:

i reduced the number of nfsd back to 32 again. This should be about the optimal number for this server.

From today on the network to this server is under constant observation, so that we can analyze it better during/after the next peak.

We do some reconfiguration of the ECAD Software so that an important controlfile is kept localy. This will be done till the end of the week.

I enabled asynchronus writes on the exported filesystems. This should improve NFS performance but i'm afraid this will also produce errors. As i enabled this only 2 hours ago i don't know about effects yet.

Do you have any more suggestions? As this problem still isn't solved, any new hint is appreciated.

Regards Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Bill McNAMARA_1
Honored Contributor

Re: Performance problem, perhaps nfs

using lanadmin find out if there are many tcp errors received on your netw if.

Can you show the op of vgdisplay -v vgdata
and strings /etc/lvmtab and ioscan -fnkC disk and any information specific to the array, just to see if the array itself is configured well.

Later,
Bill
It works for me (tm)
Stefan Schulz
Honored Contributor

Re: Performance problem, perhaps nfs

lanadmin shows nearly no errors. Both input and output errors is lower than 0.1% of the packets sent/recieved.

I have attached the data you were asking for. The RAID system is attached to a separate FWD SCSI interface and equiped with 8x50GB. Which gives us about 350 GB capacitiy. As you see we have 3 logical volumes configured.

syslog and the tools for the RAID don't show any SCSI or disk related errors or warnings. The integrity of the RAID is absolutely fine. Also there is not one single error on the disks in the RAID.

All filesystems are HFS, upgrading to VxFS is no option at the moment.

What else can i tell you? Is there any important information missing?
No Mouse found. System halted. Press Mousebutton to continue.
Mladen Despic
Honored Contributor

Re: Performance problem, perhaps nfs

Stefan,

How long has this been going on for? If the high CPU load just kind of started one day, what changed in your environment at that time? If you are aware of anything like that, it may be a clue.

Or, could this be normal for your environment?? What are the NFS clients doing? Have you seen the same activity on the clients causing significantly lower CPU utilization by nfsd processes on the server before? On a different NFS server??

Mladen
Vincent Fleming
Honored Contributor

Re: Performance problem, perhaps nfs

I just recently read something that was saying that because of a bug in the process management of earlier versions of HPUX, nfsd's should be limited to 4 to 8 processes. Perhap s you're seeing this. Apparently, the problem has to do with the system thrashing processes in and out of the run queue inappropirately. The more nfsd's out there, the more spinning the system does. Try lowering your nfsd's to 4, and see what happens.
No matter where you go, there you are.
Stefan Schulz
Honored Contributor

Re: Performance problem, perhaps nfs

Finally i am able to post again (something was wrong with the forum).

First of all: the prolem seems to be gone! we had no more high loads since i enabled asynchronus writes on the exported filesystems. Although i am happy that the problem is gone, this gives me two more questions.

Does this tell me where the bottleneck was?

How high is the risk to loose data with asynchronus writes enabled?


Mladen :

this startet about two months ago very slowly. Every week the high load values became higher and higher. I couldn't detect any problem on client side. Also there is no comparable nfs server around.

Vincent:

This is interesting. could you provide me a link to more information? I think we have all necessary patches installed, but you never know. Limiting the nfsd's to 8 is against all performance documents i found. So i would like to know more about this problem first. This is our main fileserver so i try to limit tests on this machine to the minimum.

Thanks to all for your tips and infos.

Regards Stefan
No Mouse found. System halted. Press Mousebutton to continue.