1825525 Members
2051 Online
109681 Solutions
New Discussion юеВ

High nfile Utilization

 
SOLVED
Go to solution
Ray White
Advisor

High nfile Utilization

We have a server that has periodic very high values of this metric (TBL_FILE_TABLE_UTIL), at or near 100%, which is apparently causing issues. See the attached graph.

Is there any way to identify which process(es) are causing this? My Capacity Planning group doens't have direct access to the server, but we could ask the Admins to look at this if there is any way to "drill down" beyond the global level.

Thanks,

Ray White
6 REPLIES 6
Alzhy
Honored Contributor
Solution

Re: High nfile Utilization

You can use lsof utility.

For a large server, you will typically need to increase nfile to some high value. Remember everything in UNIX is a file or involves a file descriptor. Every connection - whatever means involves a file descriptor.

So simply increase you kernel nfile. Make studies using lsof on your heaviest days and adjust nfile upwards accordingly.

On our DB server supporting 2,000 users - we have it set up to 100,000.
Hakuna Matata.
Ray White
Advisor

Re: High nfile Utilization

Thanks for the rapid response, Mr. Caparroso. We beat you by 50% - we have nfile set to 150,010! I'll check into using lsof to try to track down the root cause.

Ray
TwoProc
Honored Contributor

Re: High nfile Utilization

I can't think of a "direct" way, but you might be able to infer it with perfview (which you have).

Create application groups so that whole groups of applications can be set to collect individually. You can then start drawing graphs from perfview which not only show NFILE (like your drawing), but also have the metrics from the application groups (via DrillDown). From this you can see if there is a relationship to be inferred that the number of open files goes up when activity of a certain application group goes up (on a broad range of metrics, but you can visually see them if there is a match).

Then break down the programs from the groups into even more tightly defined sub groups from a candidate group which looks like it might match the trend.

Using divide and conquer into smaller and smaller perfview groups you may find the magic one that doing the loads of file opens.

I say "may" b/c you'll probably find its a shadow process running for users on behalf of a database, and the open files are actually opened by the database itself, but the cause of the high open file count could be a certain program accesssing loads of data which CAUSED the database to open lots of files.
We are the people our parents warned us about --Jimmy Buffett
Ray White
Advisor

Re: High nfile Utilization

Thank you, Mr. Joubert. Making changes to the MWA parm file to add applications is somewhat of a major bureaucratic hassle for us, but might be worthwhile for this. I did drill down to process-level I/O data and found some I/O intensive processes, but it seems that there may not be a strict correlation to nfile utilization: some processes may only have a few files open, but do massive I/O on them, while other processes may open large numbers of files but not do anything with them. And MeasureWear process-/application-level data doesn't identify the particular script that ran, only the program name and user ID. We'll see what we can come up with.

Ray
Ray White
Advisor

Re: High nfile Utilization

Oops, I guess I can't edit my post in this forum. Sorry for the typo; MeasureWear would be more of a tailor kind of thing. ;=)
TwoProc
Honored Contributor

Re: High nfile Utilization

OK Ray,

Here's an ugly way to get the top 10 programs that have open files via a single snapshot in time. It requires the "lsof" tool installed. You can easily fix it to give you the top 20, top 100, etc. You could also easily fix it to get rid of anything being run by root, oracle, www, etc. if you happen know that it's not one of those processes you're looking for (example).

This assumes that a search for "REG" in the output of lsof is going to get all regularly opened files. I think that's right, but if it's not, and anyone out there knows a better way PLEASE TELL ME and I'll fix it for Ray.

Also, from what I've seen from our Perl people, it could probably be done much better in Perl, but this is just something that I whipped up to maybe find the problem. Our great group of Perl forumers may feel compelled to contribute something nicer which does the same, which would be certainly welcome.

#!/bin/ksh
# top10_nfile
#
#
lsof | grep REG | awk '{ print $1" "$3" "$2 }' \
| sort > /tmp/t10_nfile.dat.$$
# just add "| grep -v root" in the above line
# before the "awk" to skip programs run by root
cat /tmp/t10_nfile.dat.$$ | uniq > /tmp/t10_nfile.uniq.$$
echo nfile_open program user processid
while read prog user procid
do
echo `grep "$prog $user $procid" /tmp/t10_nfile.dat.$$ \
| wc -l` $prog $user $procid
done < /tmp/t10_nfile.uniq.$$ | sort -n | tail
rm /tmp/t10_nfile.dat.$$ /tmp/t10_nfile.uniq.$$

We are the people our parents warned us about --Jimmy Buffett