Operating System - OpenVMS
1752826 Members
3903 Online
108789 Solutions
New Discussion юеВ

Re: VMS utility to determine file size distribution?

 
SOLVED
Go to solution
Jon Pinkley
Honored Contributor

VMS utility to determine file size distribution?

Does anyone know of a VMS utility that provides information about file size distribution?

There have been several threads asking about what cluster size to initialize a volume with, and knowing the file size distribution will help choose an appropriate size.

It's relatively easy to determine the mean file size on a volume, but determining the median size is much harder.

I tried several Google searches, but I haven't been successful. I thought there was a Decus utility, but I can't remember for sure that it exists, and if it does, what it was named.

The DFG/DFO defrag utility gives a histogram of free space extents, but not file sizes. Perhaps that is what I remember. The command to get that is:

$ defrag show /histogram

I did find a UNIX utility, fsstats. Does anyone know of something similar for VMS?

Description of fsstats
http://www.pdsi-scidac.org/fsstats/download.html

Sample output from fsstats
http://www.pdsi-scidac.org/fsstats/files/sample_output

Somewhat interesting paper file size distributions on university UNIX system (1984 vs. 2005), and how different blocksizes affect file system usage. Note that it is UNIX centric, and the UNIX and VMS file systems and caches behave differently.
http://www.cs.vu.nl/~ast/publications/osr-jan-2006.pdf
it depends
12 REPLIES 12
Jim Hintze
Advisor

Re: VMS utility to determine file size distribution?

I expect that cluster size has more to do with the geometryof the media than RMS allocation settings. The cluster size should be a devisor that produces a quotent with no remainder when using the track size as a dividend.

CA Performance advisor
advis coll repo disk sys$sysdevice >>>

Disk Analysis _$1$DGA4244: (USTPROD_SYS) Page 3
Summary of Allocated Space PSDC V3.1-0805
Thursday 27-MAY-2010 07:24


Space Allocated per Header No. Headers Cum % Headers
-------------------------- ----------- -------------

>= 64, < 128 20442 80.7
>= 128, < 192 1096 85.0
>= 192, < 320 544 87.2
>= 320, < 640 682 89.9
>= 640, < 1280 450 91.6
>= 1280, < 1920 289 92.8
>= 1920, < 3200 292 93.9
>= 3200, < 6400 332 95.2
>= 6400, < 12800 907 98.8
>= 12800, < 19200 103 99.2
>= 19200, < 32000 48 99.4
>= 32000, < 64000 24 99.5
>= 64000, < 128000 37 99.7
>= 128000, < 192000 41 99.8
>= 192000, < 320000 24 99.9
>= 320000, < 640000 2 99.9
>= 640000, < 1280000 1 99.9
>= 1280000, < 1920000 0 99.9
>= 1920000, < 3200000 4 99.9
>= 3200000 16 100.0
Fekko Stubbe
Valued Contributor
Solution

Re: VMS utility to determine file size distribution?

Hii Jon,
I do have a program written in fortran to display the consequences of changing the cluster size. It will report the loss of space for various clustersizes.
See example below:
$clus dsa0:
Maximum indexfile
Fileheaders total 1047510 used 39975 free 1007535
Current size indexfile = 157410 blocks
Fileheaders total 157142 used 39975 free 117167
Current clustersize = 3
Disk DSA0: has volume SYS_ITV002 Max. Files 1047510 ( 4.35%)
#blocks used 8884089 #blocks allocated 9205746 Waste = 3.62%
Lost blocks = 269391
Found 45559 files, total nblock = 8884089 average size = 195
Found 9 files >= 65536 total size 3755249
Cluster size 3 tot_blocks = 8936364 waste = 0.59%
Cluster size 1 tot_blocks = 8884093 waste = 0.00%
Cluster size 2 tot_blocks = 8910406 waste = 0.30%
Cluster size 3 tot_blocks = 8936364 waste = 0.59%
Cluster size 4 tot_blocks = 8965407 waste = 0.92%
Cluster size 8 tot_blocks = 9084149 waste = 2.25%
Cluster size 16 tot_blocks = 9346377 waste = 5.20%
Cluster size 18 tot_blocks = 9418634 waste = 6.02%
Cluster size 32 tot_blocks = 9916961 waste = 11.63%
Cluster size 35 tot_blocks = 10030451 waste = 12.90%
Cluster size 64 tot_blocks = 11150289 waste = 25.51%
Cluster size 70 tot_blocks = 11388084 waste = 28.19%
Cluster size 144 tot_blocks = 14473241 waste = 62.91%

It can also report on counts for filesizes.
Is this something you need?

Fekko
Hein van den Heuvel
Honored Contributor

Re: VMS utility to determine file size distribution?


John, I've written such tools but I cant's readily find them. Mostly I just found that the simple algorithms are best.

The Maximum waste is simply the number of files timer clustersize - 1.

The Average waste is simply the number of files times the half the cluster size.

Now if you had a dominant file allocation usage, say 100,000 files out of the 300,000 files are always 1234 block, then you would know that right? So calculate the exact waste for those file and exclude them from the rest.
Or just pick the clustersize to match that dominant size.

Jim>> I expect that cluster size has more to do with the geometryof the media than RMS allocation settings. The cluster size should be a devisor that produces a quotent with no remainder when using the track size as a dividend.

The notion of exploiting disk geometry went out with the horse and buggies.
Disks have had variable geometry for decades (more segments on the outer bands than the inner band).
And mostly folks do not talk to disks directly anyway, but through smart controllers which slice and dice allocation over spindles as they see fit.

OpenVMS just uses 1M (1024*1024) clusters per disk, figuring that will allow for up to a million (single extent) files if need be and reasonable waste for most usages

The single suggestion I end up with is to pick a power of 2, and make it large when you anticipate few files (thousands) and smallish like 16 when expecting many files (hundreds of thousands)

This will help align the with the typical RMS sequential file defaults, and the storage folks like it, and the XFC cache likes it.
It is tricky to get it all to work together, but with clustersizes like 16 or 256 you have the best odds.

Mark Hopkins once produced a histogram of MB/sec through the caches varying IO sizes. It had significant peaks as 4,8,16,32 and 64, the peaks at 16 and 32 being the highest.

Hope this helps some,
Hein


Jim Hintze
Advisor

Re: VMS utility to determine file size distribution?

Oh, I thought he was talking about an RM03 on an 11/780. :-)
Hoff
Honored Contributor

Re: VMS utility to determine file size distribution?

Here's some very quick DCL:

http://labs.hoffmanlabs.com/node/1582

Not pretty, but it gives you a powers-of-10 distribution.

That result could be switched into a percentage graph or various other displays with minimal effort; this stuff isn't rocket science.

FWIW... Disk geometry is increasingly fictional on current-generation hard disks; VMS eliminated dependence on that a while back. VMS also expects the older 512 byte sector sizes. The current-generation IDEMA-compliant hard disk drive devices are now arriving with 4 KiB sectors; when HP might add that support and might start deploying SSD hardware is not something I'm aware of.
Hein van den Heuvel
Honored Contributor

Re: VMS utility to determine file size distribution?

Jim, Nice comeback ! Made me smile!


Jon,
I rolled some quick perl (see below), which may be good enough.
It may well have an off-by-one error ( or two ), but it will help you decide what you need.

I feed it with DFU SEARCH output, or DIR/SIZE=ALL output.

Here is what it gives for the EISNER scratch disk ( DRA2 ) as example:

64416 files with some allocation, Total allocation = 9146582, Average size = 141.

Zone Limit Count.
0 1 4586
1 2 24275
2 7 24790
3 20 5496
4 54 2783
5 148 1288
6 403 676
7 1096 287
8 2980 112
9 8103 69
10 22026 26
11 59874 19
12 162754 8
13 442413 1

21 1318815734 0


Cluster Waste Simple
1 0 0
2 32892 32208
3 68845 64416
4 100098 96624
5 134978 128832
6 162088 161040
7 191782 193248
8 224946 225456
9 256105 257664
10 293788 289872
11 334703 322080
12 368014 354288
13 415022 386496
14 463018 418704
15 514318 450912
16 559786 483120
:
126 7156558 4026000
127 7217495 4058208
128 7282346 4090416

As you can see, the AVG simple algorithm drifts away from reality a lot when the cluster size becomes larger than most files. The other simple max calculation then becomes closer: files * (cluster_size minus fudge).
Fudge would be the mean file allocation.

Hein

$! ------ [hein]file_and_cluster_sizes.pl ----
# File size distribution and cluster size effects
# Feed this with DFU SEARCH OUTPUT or DIR/SIZE=ALL. Should look like:
# x$y:[p.q.r]a.b;n n/m

$min_cluster = 1;
$max_cluster = 128;

while (<>) {
next unless /;\d+\s+(\d+)\//;
next unless $1;
$allocation = $1;
$total_allocation += $1;
$files++;
$zone[ int(log($allocation)) ]++;
for ($i=$min_cluster ; $i<=$max_cluster; $i++) {
$used = $allocation % $i;
$waste[$i] += ($used) ? $i - $used : 0 ;
}
}
$avg = int($total_allocation/$files);
printf "$files files with some allocation, Total allocation = $total_allocation, Average size = $avg.\n";
print "\n Zone Limit Count.\n";
for ($i=0; exp($i) < 2**31; $i++) { # VMS will do 2GB files soon
printf "%5d%12d%12d\n", $i, int(exp($i)), $zone[$i];
}
print "\nCluster Waste Simple\n";
for ($i=$min_cluster ; $i<=$max_cluster; $i++) {
printf "%5d%12d%12d\n", $i, $waste[$i], int($files * ($i-1) / 2) ;
}
John Gillings
Honored Contributor

Re: VMS utility to determine file size distribution?

Jon,

Rather than try to build a utility to do this, I'd start with some raw data and play with it in a spreadsheet.

$ PIPE DIRECTORY/NOHEAD/NOTRAIL/SIZE=ALL disk:[000000...]*.*;* | search sys$pipe "/" > rawdata.dat

now edit rawdata.dat, change all "/" into "," and collapse out spaces. You now have a CSV file with file sizes in the first column and allocations in the second.

Read the data into your favourite spreadsheet and generate histograms of sizes and allocations. You can also experiment with different cluster sizes by generating columns with projected minimum allocations and comparing column totals.

You have many more options for manipulating and visualising the data than you could easily build into a utility.
A crucible of informative mistakes
Jon Pinkley
Honored Contributor

Re: VMS utility to determine file size distribution?

Jim,

Thanks, the performance advisor report is probably what I remember. We had the DEC performance advisor, but when it was sold to CA we didn't continue with the maintenance, and many of the features of PSDC were VMS version dependent, so it is no longer in our startup.

Since the disk analysis reporting feature is just looking at the INDEXF.SYS and BITMAP.SYS files, I decided to try it, and the PSDC V2.2-51 PSDC$DSKANL still works (with 1995 limitations) on VMS 8.3, although it is throwing a warning message. Since it predated extended bitmaps, perhaps that is what the buffer

OT$ anal/image/sel=(id,link,build) sys$system:psdc$dskanl.exe
SYS$COMMON:[SYSEXE]PSDC$DSKANL.EXE;2
"PSDC V2.2-51"
11-OCT-1995 11:47:58.55
""
OT$ advise collect report disk disk$user1 /out=scr:t.t
%PSDC-W-GETMSGWARN, $GETMSG System Service Warning
-SYSTEM-S-BUFFEROVF, output buffer overflow
OT$

I have attached a portion of the output from the above command for anyone that is interested.

This utility is quite efficient, as it gets its info by scanning indexf.sys and bitmap.sys directly (it does not have to traverse all directory files).

So I can use this, but it isn't something that we can expect ITRC users to be able to run.

BTW, are you the same Jim Hintze that presented the paper about Disk Fragmentation at the spring 1981 DECUS symposium? (I am guessing you are since you know about RM03 drives)

On the Fragmentation of Disk, Jim Hintze, Eric Deaton,
Weeg Computing Center, pp. 1321-1325 Proceedings of the
Digital Equipment Computer Users Society Spring 1981,
Volume 7, Number 4.

---------------------

Fekko,

If this is something that you can share, then yes, I would be interested. Especially if it can be released like DIX and ACX, so ITRC folks could use it to gather information. I didn't see it on the http://www.oooovms.dyndns.org/ site, but perhaps I didn't know where to look.

I assume that it is getting its file info by going directly to indexf.sys, and not by traversing all directories on the disk. If so, and this is freely available, this may be the fastest freely available tool for file size distribution analysis.

---------------------

Hein,

I agree with your recommendations here and in http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1431785

Specifically, a cluster size that is a power of 2, and using different disks for small volatile files and large infrequently extended files. Also, pre-extending the large RMS indexed files makes a lot of sense, especially if you can do so with CBT extensions. Just curious, do your tools allow you to extend a file while the file is open (with update sharing allowed) by another process? I assume your tool uses the $EXTEND service.

One reason I like powers of 2 for cluster sizes is that will prevent files growing when backing up from one disk to another, even if backup/truncate is not used.

Being nitpicky with your first comment. Since VMS 7.2 the file allocation bitmap is no longer limited to 255 blocks (255 blocks * 512 bytes/block * 8 bits/byte = 1044480 bits which limits to 1044480 extents per pre-7.2 disk); it is now up to 65535*512*8 = 268431360 extents per disk. Since your comment was in a "performance" related thread, I will concede that using extended bitmaps is not for enhanced performance (it can cause slow CBT creation/extension on fragmented disks); it is to allow for better space utilization of large disks.

The perl tool you provided is good because it allows a subset of files on the disk to be analyzed. This is also true for DCL that Hoff and the suggestion by John Gillings. For a disk with many files on it, it would probably be quite a bit slower than a tool that is going directly to INDEXF.SYS for the info.

---------------------

Hoff,

Interesting method to get powers of ten without a log function.

$ sizelen = f$length(size)
$ count_'sizelen' = count_'sizelen' + 1

Your submission has the advantage of working standalone on a plain vanilla VMS system that can't have any software
installed.

---------------------

John Gillings,

For detailed analysis, loading into a spreadsheet is a good method. There may be issues with some spreadsheets not being able to deal with the number of records on a disk with many files.

---------------------

All,

Just for completeness, I did find another utility that will supply file size distribution. Executive Software (makers of the DisKeeper defrag utility) has a "Disk Analysis Utility" available for download "free to qualified System Manager". I saw the pointer to this in their online whitepaper "Fragmentation: the Condition, the Cause, the Cure" http://www.diskeeper.com/fragbook/FRAGBOOK.HTM (infomercial for DisKeeper) where you can see an example of the output. There is a link to a page to download the "OpenVMS Fragmentation Analysis Utility", but I didn't download it because I didn't want to be on their mailing list, and you must supply info to be able to download.

http://www.diskeeper.com/trialware/trialwareproducts.aspx ! link to page with download form (requires info I didn't want to provide)

Jon
it depends
Fekko Stubbe
Valued Contributor

Re: VMS utility to determine file size distribution?

Jon,

I will make a kit and place it on oooovms.dyndns.org. Give me a week or so.
I will ask Ian Miller to place an announcement on openvms.org.

Fekko