System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

how to use lsof to find files

NDO
Super Advisor

how to use lsof to find files

 

 

Hi

 

my /var is full, but I suspect someone has deleted files while they were open, please can I some help in using lsof to find out if there is any ?

13 REPLIES
Patrick Wallek
Honored Contributor

Re: how to use lsof to find files

Try:

 

# lsof +aL1 /var

 

From the lsof man page:

 

"A specification of the form '+aL1 <file_system>' will select unlinked open files on the specified file system."

 

NDO
Super Advisor

Re: how to use lsof to find files

Hi

 

The output of that command, resulted in the following, but how do I spot unlinked open files? Is it from the

#lsof +aL1 /var | more
COMMAND    PID   USER   FD   TYPE DEVICE  SIZE/OFF NLINK  NODE NAME
sh         160 ora10g    1w   REG 64,0x8  58793984     0 10164 /var (/dev/vg00/lvol8)
sh         160 ora10g    2w   REG 64,0x8  58793984     0 10164 /var (/dev/vg00/lvol8)
sh         163 ora10g    1w   REG 64,0x8  58793984     0 10164 /var (/dev/vg00/lvol8)
sh         163 ora10g    2w   REG 64,0x8  58793984     0 10164 /var (/dev/vg00/lvol8)
sh         232 ora10g    1w   REG 64,0x8  22306816     0 40030 /var (/dev/vg00/lvol8)
sh         232 ora10g    2w   REG 64,0x8  22306816     0 40030 /var (/dev/vg00/lvol8)
sh         252 ora10g    1w   REG 64,0x8  55246848     0 37931 /var (/dev/vg00/lvol8)
sh         252 ora10g    2w   REG 64,0x8  55246848     0 37931 /var (/dev/vg00/lvol8)
sh         401 ora10g    1w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)
sh         401 ora10g    2w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)
sh         403 ora10g    1w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)
sh         403 ora10g    2w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)
sshd       946   root    3u  unix 64,0x8       0t0     0  6614 /var/spool/sockets/pwgr/client945 (0xbbf523c0)
lpsched   1599     lp    3u  unix 64,0x8       0t0     0     6 /var/spool/sockets/pwgr/client1598 (0xbbf52ac0)
swagentd  1644   root    5u   REG 64,0x8        79     0  9404 /var (/dev/vg00/lvol8)
sh        1848 ora10g    1w   REG 64,0x8  32456704     0 36256 /var (/dev/vg00/lvol8)
sh        1848 ora10g    2w   REG 64,0x8  32456704     0 36256 /var (/dev/vg00/lvol8)
sh        1870 ora10g    1w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1870 ora10g    2w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1874 ora10g    1w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1874 ora10g    2w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        2395 ora10g    1w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2395 ora10g    2w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2398 ora10g    1w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2398 ora10g    2w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2547 ora10g    1w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2547 ora10g    2w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2555 ora10g    1w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2555 ora10g    2w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2602 ora10g    1w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2602 ora10g    2w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2606 ora10g    1w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2606 ora10g    2w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        3257 ora10g    1w   REG 64,0x8  31891456     0 10271 /var (/dev/vg00/lvol8)
sh        3257 ora10g    2w   REG 64,0x8  31891456     0 10271 /var (/dev/vg00/lvol8)

 NLINK column?

 

 

Matti_Kurkela
Honored Contributor

Re: how to use lsof to find files

Yes. If a file has been deleted (unlinked) but is still open, its NLINK value will be 0.

 

The "FD" column may be useful too: values "1w" and "2w" refer to file descriptors #1 and #2 of the process, which usually are the standard input and standard error streams of the process.

 

For example, your "sh" process with PID 160 has had its output and error streams redirected to a file in the /var filesystem that had inode number #10164. The file has now been deleted, so its name is no longer known to the system; at this point, it can only be identified by the combination of the filesystem mount point + inode number. The process with PID 163 has its output and error streams directed to the same inode: my first guess would be that these two processes might be related to a cron job that has not finished for some reason. Running "ps -fp 160,163" might produce enlightenment.

 

Apparently the names of open Unix domain sockets can still be recovered even if the socket has been deleted: note the /var/spool/sockets/pwgr/client* entries in your lsof output. These sockets are for communicating with the pwgrd daemon: it provides a cache for user/group information.

 

Note that all the deleted-but-still-open regular files in your lsof output seem to be standard output & error streams from shell processes run by user "ora10g". You might want to check any cron jobs of the ora10g user.

 

On an Oracle database account, any cron jobs are most commonly related to database backups or other data manipulations, so you might want to contact the responsible DBA. Together with the DBA, you should find out what these shell processes are and if they can be safely killed off. Then the next step is figuring out the root cause: why are they producing so large outputs (as the SIZE/OFF column suggests)? Is it a so-far-unnoticed consequence of another problem that has already been dealt with (e.g. a problem with a backup tape library that has already been fixed, but caused some database backup jobs to hang), or is some modification needed to avoid this happening again?

MK
NDO
Super Advisor

Re: how to use lsof to find files

Hi

 

Thank you so much for your reply, it shed some light on this... now the DBA has told me that he has commented all cron entries for sometime now, because some of it were actually scripts that were ftp data from the server to another, but some of these ftps were failing due to network issues, and were also causing /var/spool/mqueue to fill up,

Also from your suggestion:

#ps -fp 160,163
ps: error on write
ps: error on write
     UID   PID  PPID  C    STIME TTY       TIME COMMAND
  ora10g   160     1  0  Sep  2  ?         0:00 sh -c /data8/oradata/files/rms/cronFTP_1830.ksh
  ora10g   163   160  0  Sep  2  ?         0:00 /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh
mcelrate[301]/ #

 those should never be running now. Can I use "kill <PID>" kill 160 and kill 163. Would this free up some space on /var?

 

I have also have some ftp processes:

 

sh       24235 ora10g    2w   REG 64,0x8   3031040     0  9485 /var (/dev/vg00/lvol8)
ftp      24256 ora10g    0u   REG 64,0x8         0     0 50826 /var (/dev/vg00/lvol8)
ftp      24256 ora10g    2w   REG 64,0x8   1196032     0 21910 /var (/dev/vg00/lvol8)
ftp      24260 ora10g    0u   REG 64,0x8         0     0 50828 /var (/dev/vg00/lvol8)
ftp      24260 ora10g    2w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
ftp      24267 ora10g    0u   REG 64,0x8         0     0 14797 /var (/dev/vg00/lvol8)
ftp      24267 ora10g    2w   REG 64,0x8  38739968     0 30468 /var (/dev/vg00/lvol8)
ftp      24268 ora10g    0u   REG 64,0x8         0     0 18035 /var (/dev/vg00/lvol8)
ftp      24268 ora10g    2w   REG 64,0x8   3031040     0  9485 /var (/dev/vg00/lvol8)
ftp      24272 ora10g    0u   REG 64,0x8         0     0 23821 /var (/dev/vg00/lvol8)
ftp      24272 ora10g    2w   REG 64,0x8  59826176     0 39401 /var (/dev/vg00/lvol8)
ftp      24279 ora10g    0u   REG 64,0x8         0     0 14243 /var (/dev/vg00/lvol8)
ftp      24279 ora10g    2w   REG 64,0x8   1228800     0 10353 /var (/dev/vg00/lvol8)
ftp      24280 ora10g    0u   REG 64,0x8         0     0 37027 /var (/dev/vg00/lvol8)
ftp      24280 ora10g    2w   REG 64,0x8  36782080     0 11248 /var (/dev/vg00/lvol8)
ftp      24284 ora10g    0u   REG 64,0x8         0     0 39370 /var (/dev/vg00/lvol8)
ftp      24284 ora10g    2w   REG 64,0x8  31891456     0 10271 /var (/dev/vg00/lvol8)
sh       24305 ora10g    1w   REG 64,0x8   2744320     0  9503 /var (/dev/vg00/lvol8)

 

Matti_Kurkela
Honored Contributor

Re: how to use lsof to find files

Note the STIME (start time) field: these processes were started on September 2 (probably year 2013). They have been hanging for a long time.

 

Killing both these processes would allow the inode #10164 to be freed, and that will probably free up around 58793984 bytes of space. However, if other processes are waiting to write to their temporary files and the filesystem is currently 100% full, they might immediately use up some or all of that space.

 

Since you said these scripts are essentially FTP data transfers, I guess these processes can safely be killed.

You should check all the remaining PIDs in your lsof output the same way.

 

(The sshd, lpsched and swagentd daemons should probably be restarted to let them recreate their pwgrd sockets, rather than simply killed off.)

MK
NDO
Super Advisor

Re: how to use lsof to find files

why some of these processes apparently do not exist?

ftp       5841 ora10g    0u   REG 64,0x8         0     0 50721 /var (/dev/vg00/lvol8)
ftp       5841 ora10g    2w   REG 64,0x8  14761984     0 28575 /var (/dev/vg00/lvol8)
ftp       5861 ora10g    0u   REG 64,0x8         0     0 44754 /var (/dev/vg00/lvol8)
ftp       5861 ora10g    2w   REG 64,0x8   2195456     0 10056 /var (/dev/vg00/lvol8)
ftp       5877 ora10g    0u   REG 64,0x8         0     0 42466 /var (/dev/vg00/lvol8)
ftp       5877 ora10g    2w   REG 64,0x8   2834432     0  9794 /var (/dev/vg00/lvol8)
ftp       5885 ora10g    0u   REG 64,0x8         0     0 36533 /var (/dev/vg00/lvol8)
ftp       5885 ora10g    2w   REG 64,0x8  39297024     0 35672 /var (/dev/vg00/lvol8)
sh        6144 ora10g    1w   REG 64,0x8  45703168     0 10064 /var (/dev/vg00/lvol8)
sh        6144 ora10g    2w   REG 64,0x8  45703168     0 10064 /var (/dev/vg00/lvol8)
sh        6168 ora10g    1w   REG 64,0x8   7233536     0 10517 /var (/dev/vg00/lvol8)
sh        6168 ora10g    2w   REG 64,0x8   7233536     0 10517 /var (/dev/vg00/lvol8)
sh        6171 ora10g    1w   REG 64,0x8   7233536     0 10517 /var (/dev/vg00/lvol8)
sh        6171 ora10g    2w   REG 64,0x8   7233536     0 10517 /var (/dev/vg00/lvol8)
ftp       6240 ora10g    0u   REG 64,0x8         0     0 50912 /var (/dev/vg00/lvol8)
ftp       6240 ora10g    2w   REG 64,0x8   2867200     0  9599 /var (/dev/vg00/lvol8)
ftp       6252 ora10g    0u   REG 64,0x8         0     0 50788 /var (/dev/vg00/lvol8)
ftp       6252 ora10g    2w   REG 64,0x8   6356992     0 25810 /var (/dev/vg00/lvol8)
ftp       6260 ora10g    0u   REG 64,0x8         0     0 46813 /var (/dev/vg00/lvol8)
mcelrate[320]/ #kill 5841
mcelrate[321]/ #kill 5861
kill: 5861: The specified process does not exist.
mcelrate[322]/ #kill 5885
kill: 5885: The specified process does not exist.
mcelrate[323]/ #

 as stated above

 

and if I do for example:

pick up PID 401:

 

sh         252 ora10g    2w   REG 64,0x8  55246848     0 37931 /var (/dev/vg00/lvol8)
sh         401 ora10g    1w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)
sh         401 ora10g    2w   REG 64,0x8  23371776     0 23372 /var (/dev/vg00/lvol8)

 and if I do ps -ep 401, it yelds:

mcelrate[327]/ #ps -ep 401 | more
ps: error on write
ps: error on write
   PID TTY       TIME COMMAND
     0 ?        262:56 swapper
     1 ?        1200:21 init
     8 ?         0:00 kmemdaemon
     9 ?         0:06 ioconfigd
    10 ?         0:00 ObjectThreadPo
    11 ?         6:46 nfsktcpd
    12 ?         0:30 autofskd
    13 ?         8:11 lvmkd
    14 ?         8:44 lvmkd
    15 ?         8:55 lvmkd
    16 ?         8:34 lvmkd
    17 ?         9:56 lvmkd
    18 ?         8:37 lvmkd
    19 ?        10:21 lvmschedd
    20 ?        225:59 ksyncer_daemon
    21 ?         1:42 lvmdevd
    22 ?         0:00 lvmattachd
    23 ?         0:00 pagetable_init
    24 ?         0:00 supsched
    25 ?         0:00 strmem
    26 ?         0:00 strweld
    27 ?         0:00 strfreebd
     2 ?        13:50 vhand
     3 ?        67:39 statdaemon
     4 ?         6:06 unhashdaemon
    28 ?        13:01 progressdaemon
    29 ?         1:04 ttisr
    30 ?        14:44 ipmid
    37 ?         0:00 eventdaemon
    38 ?        337:56 schedcpu
    39 ?        11:46 pagezerod
    40 ?         0:18 cmcd
    41 ?        39:53 smpsched
    42 ?        39:56 smpsched
    43 ?        39:50 smpsched
Standard input

 ??

 

Dennis Handly
Acclaimed Contributor

Re: how to use lsof to find files

>why some of these processes apparently do not exist?

 

As soon as you killed 5841 the other two died?  If you are killing off a bunch you might want to first dump the hierarchy:

UNIX95=EXTENDED_PS ps -H -f -p 5841,5861,5885

NDO
Super Advisor

Re: how to use lsof to find files

I did the following:

 

swagentd  1644   root    5u   REG 64,0x8        79     0  9404 /var (/dev/vg00/lvol8)
sh        1848 ora10g    1w   REG 64,0x8  32456704     0 36256 /var (/dev/vg00/lvol8)
sh        1848 ora10g    2w   REG 64,0x8  32456704     0 36256 /var (/dev/vg00/lvol8)
sh        1870 ora10g    1w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1870 ora10g    2w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1874 ora10g    1w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        1874 ora10g    2w   REG 64,0x8   8699904     0 10455 /var (/dev/vg00/lvol8)
sh        2398 ora10g    1w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2398 ora10g    2w   REG 64,0x8  45506560     0 15219 /var (/dev/vg00/lvol8)
sh        2547 ora10g    1w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2547 ora10g    2w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2555 ora10g    1w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2555 ora10g    2w   REG 64,0x8  51781632     0 40792 /var (/dev/vg00/lvol8)
sh        2602 ora10g    1w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2602 ora10g    2w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2606 ora10g    1w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        2606 ora10g    2w   REG 64,0x8  16236544     0 28060 /var (/dev/vg00/lvol8)
sh        3257 ora10g    1w   REG 64,0x8  31891456     0 10271 /var (/dev/vg00/lvol8)
sh        3257 ora10g    2w   REG 64,0x8  31891456     0 10271 /var (/dev/vg00/lvol8)
sh        3273 ora10g    1w   REG 64,0x8  44695552     0 36778 /var (/dev/vg00/lvol8)
sh        3273 ora10g    2w   REG 64,0x8  44695552     0 36778 /var (/dev/vg00/lvol8)
sh        3417 ora10g    1w   REG 64,0x8   2932736     0 10347 /var (/dev/vg00/lvol8)
sh        3417 ora10g    2w   REG 64,0x8   2932736     0 10347 /var (/dev/vg00/lvol8)
sh        3420 ora10g    1w   REG 64,0x8   2932736     0 10347 /var (/dev/vg00/lvol8)
sh        3420 ora10g    2w   REG 64,0x8   2932736     0 10347 /var (/dev/vg00/lvol8)
mcelrate[349]/ #UNIX95=EXTENDED_PS ps -H -f -p 1848,1870,1874
ps: error on write
ps: error on write
UID        PID  PPID  C    STIME TTY          TIME CMD
ora10g    1848     1  0  Oct  9  ?           00:00 /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh
ora10g    1870     1  0  Nov 13  ?           00:00 sh -c /data8/oradata/files/rms/cronFTP_1830.ksh
ora10g    1874  1870  0  Nov 13  ?           00:00   /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh
mcelrate[350]/ #kill 1848,1870,1874
sh: 1848,1870,1874: Specify a process identifier or a %job number.
mcelrate[351]/ #

 I am not sure  of the "Specify a process identifier or a %job number.

Would  it be possible to have  help on a script that would kill all the output of the lsof command? It looks like no space is being recovered after killing some processes

Dennis Handly
Acclaimed Contributor

Re: how to use lsof to find files

>#kill 1848,1870,1874

> I am not sure  of the "Specify a process identifier or a %job number.

 

ps wants commas but kill doesn't.

 

>Would  it be possible to have  help on a script that would kill all the output of the lsof command?

 

This would be trivial to write an awk script but it could be dangerous.

I suppose the first cut would be to do the ps.

 

 

 

Dennis Handly
Acclaimed Contributor

Re: how to use lsof to find files

>Would  it be possible to have  help on a script that would kill all the output of the lsof command?

 

This will select the PIDs:

UNIX95=EXTENDED_PS ps -H -f -p $(lsof +aL1 /var | awk '
BEGIN {
   Large_file = 1000000 # min size to flag
   Regular_file = "REG"
   pid_list = ""
   getline # eat header
}
$8 == 0 && $5 == Regular_file && $7 >= Large_file {
   if (pid_list == "")
      pid_list = $2
   else
      pid_list = pid_list "," $2
}
END {
   print pid_list
}')

 

Kill them:

kill $(lsof +aL1 /var | awk '
BEGIN {
   large_file = 1000000 # min size to flag
   regular_file = "REG"
   pid_list = ""
   getline # eat header
}
$8 == 0 && $5 == regular_file && $7 >= large_file {
   if (pid_list == "")
      pid_list = $2
   else
      pid_list = pid_list " " $2
}
END {
   print pid_list
}')

NDO
Super Advisor

Re: how to use lsof to find files

Thank you , as you said "dangerous", I am a bit reluctant.

NDO
Super Advisor

Re: how to use lsof to find files

all the processes seen by "ps" are the same script (crontab entry)

mcelrate[390]/ #UNIX95=EXTENDED_PS ps -H -f -p 7121,7122,7127,7142
ps: error on write
ps: error on write
UID        PID  PPID  C    STIME TTY          TIME CMD
ora10g    7142 11688  0  Sep 13  ?           25:44 /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh
ora10g    7122 21593  0  Sep 13  ?           25:42 /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh
ora10g    7121     1  0  Sep 27  ?           00:00 sh -c /data8/oradata/files/rms/cronFTP_1830.ksh
ora10g    7127  7121  0  Sep 27  ?           00:00   /usr/bin/sh /data8/oradata/files/rms/cronFTP_1830.ksh

 

Dennis Handly
Acclaimed Contributor

Re: how to use lsof to find files

>as you said "dangerous", I am a bit reluctant.

 

Dangerous in that if you don't review the output from the first, you might kill too many with the second.

You could put a check on the USER column.

 

>all the processes seen by "ps" are the same script (crontab entry)

 

These are all old.  One tree has been orphaned.

Can you add these PIDs to the list to run under ps: 11688 21593