Operating System - HP-UX
1825705 Members
3266 Online
109686 Solutions
New Discussion

statistics by file extension (sic) required

 
SOLVED
Go to solution
Derek Brown
Frequent Advisor

statistics by file extension (sic) required

Hi

I wonder if you guys could help me please ?

My boss has given me a task that I thought might be straight forward and is proving too difficult for me. I know unix doesnt use file extensions as such but often files are created that end in *.dbf , *.doc , *.txt etc. He wants a scan of a full server which shows stats for each different file extension showing the capacity used by that file extension (in Gb preferably) and also the number of files that exist for each file extension.

I came across quite a good example when I googled for it but the example was for a Mac and the sed command looks like it might be slighly different on Mac as it gives an error when ran on HP. This is the syntax I found :

find / -fstype local -type f 2>/dev/null | tr '[:upper:]' '[:lower:]' | sed -Ee 's/^.*\/\.?//' -e 's/.*(\.[^.]*)/\1/' -e 's/^[^.]*$/NONE/' | sort | uniq -c | sort +0nr

The full item can be seen at http://ask.metafilter.com/21222/Most-idespread-file-format

I would much apreciate your help. thanks
3 REPLIES 3
TTr
Honored Contributor

Re: statistics by file extension (sic) required

Someone might already have a script that does this but...

You can easily adjust the given command to work in hp-ux.

find /etc -type f | sed -e 's?/.*/??' | grep "\." | sed -e 's/.*\.//' | sort | uniq -c |sort +0nr |more

(The two sed and the grep can probably be simplified into one sed but I dont have my reference book with me right now. I am sure someone will correct me)

The capacity as you say is more involved. Probably start with something like this

find /etc -type f -exec ll {} \; | awk '{print $5" "$9}' | sed -e 's?/.*/??' > /size-name

You now have a long listing of the size and the name of each file.

From here on you have at least two options.

1) Use an excel spreadsheet to sort filenames and sum up the sizes

2) write a script to grep each extension and use the "bc" or "dc" calculators to add up the size of each file



Hein van den Heuvel
Honored Contributor
Solution

Re: statistics by file extension (sic) required

I know I rpelied a similar question (for OpenVMS) hger before but can not readily find it. Anyway... here is a perl 'one-liner':

find . | perl -ne $ find . | perl -ne 'chomp; $t=(/[^.\/]\.(\w+)$/)?$1:"?"; $c{$t}++; $s{$t}+=-s $_ }{ for (sort keys %c) { printf "%6d %5.1fmb %s\n", $c{$_}, $s{$_}/1048576, $_}'

But is is better written as a (perl) script...

---- by_file_extention.pl -------
use strict;
my ($extention, %size, %count);
while (<>) {
chomp;
$extention=lc((/[^.\/]\.(\w+)$/)?$1:"? no extention");
$count{$extention}++;
$size{$extention}+=-s;
}
for (sort keys %count) {
printf "%6d %5.1fmb %s\n", $count{$_}, $size{$_}/(1024*1024), $_;
}

For the usage example I just used a local directory structure. And I report 'kb', not gb. Easy edit for label and and extra *1024
And you for the 'unkown' extention you may want to code upa -d (directory) test and call it that. Left as an excersize!

find . | perl by_file_extention.pl

50 3.1mb ? no extention
5 0.0mb awk
5 0.0mb c
1 0.0mb dos
2 0.0mb el
2 1.1mb exe
:
Hein van den Heuvel
Honored Contributor

Re: statistics by file extension (sic) required

Here is a version which adds directory logic, and a grand total.

Note, those elements are labeled with ~~ to make them sort towards then end.

Example:

$ find /opt | perl by_file_extension.pl
find: ... some messages for STDERR open ...
25 0.1 0
:
1318 0.0 pl
1834 0.0 pm
568 0.0 png
:
5 0.0 zip
4743 0.0 ~~ Directory
3869 1.0 ~~ No extension
51994 3.7 ~~~ Grand Total ~~~


Updated source

----------- by_file_extension.pl -----

use strict;
my ($extension, %size, %count);
while (<>) {
chomp;
$extension=(/[^.\/]\.(\w+)$/) ? lc($1) : (-d) ? "~~ Directory" : "~~ No extension";
$count{$extension}++;
$size{$extension}+=-s;
$count{"~~~ Grand Total ~~~"}++;
$size{"~~~ Grand Total ~~~"}+=-s;
}
for (sort keys %count) {
printf "%8d %7.1f %s\n", $count{$_}, $size{$_}/(2**30), $_;
}

So I match the file names found by find with:
/[^.\/]\.(\w+)$/

So is looks for ....
[^.\/] = NOT ( a dot or a slash ), excluding 'hidden' files as extension.
\. = a dot
(\w+) = 1 or more 'word' characters (a-z, 0-9, _) ... and remember in $1
$ = at the enf of the line.

If matches, then use the lower case for $1 (the word) as a key in an associative array.
So .EXE is counted with .exe

If not match, then check whether it is a directory ( -d ) and pick an artificial extension name based on result.

Enjoy!
Hein.