Operating System - HP-UX
1823729 Members
2849 Online
109664 Solutions
New Discussion юеВ

millions of files per directory

 
SOLVED
Go to solution
Joe Odman
Occasional Advisor

millions of files per directory

Is there any filesystem that supports millions of files per directory? These are very small files. Assuming that the filesystem must continue to support the large numbers, how can I alleviate the obvious performance issues? Does HP, Veritas, Pillar, NetApp, EMC, or anyone have a solution?
14 REPLIES 14
lawrenzo
Trusted Contributor

Re: millions of files per directory

The problem with having millions of files in a directory is when you run commands against the directory ie ls or find etc the command will run for an exessive period of time and if you use a wild card the the search ie ls -l * this may fail due to memory.

Other things like backups or filesystem sychronisation may be prolonged due to the number of files that have to be opened and written etc.

As far as I am aware the inode limit is the setting that determines how many files can be in a filesystem as each file is added to the inode table.

hello
lawrenzo
Trusted Contributor

Re: millions of files per directory

what application will be writing these files or accessing them? is there a tuning doc for the application?
hello
Joe Odman
Occasional Advisor

Re: millions of files per directory

Assume that the application cannot change. The filesystem itself must change.
Wouter Jagers
Honored Contributor

Re: millions of files per directory

I have seen the side-effects Lawrenzo is talking about first-hand, and I can tell you it is far from easy to work on a problem within such directories.

Best thing to do (if possible) is to create some sort of hashing-algorithm to put these millions of files in a tree of subdirectories.

For example, given a bunch of files ranging from 'a000000' to 'c999999' you could start by having subdirectories 'a', 'b', and 'c' (each good for one millon files). Within these directories you could then have subdirectories '000' to '999', each holding one thousand files.

A simplified example, of course, but I'd try implementing something like this in order to avoid the complications described above.

Cheers,
Wout
an engineer's aim in a discussion is not to persuade, but to clarify.
lawrenzo
Trusted Contributor

Re: millions of files per directory

you will have to set the inode limit or at least check the size and determine if this will be reached.


setup a script as mentioned to move the files into sub directories or the environment will become unmanageable.

HTH
#

Chris
hello
James R. Ferguson
Acclaimed Contributor
Solution

Re: millions of files per directory

Hi Dave:

Divide and conquer. That said, by using current VxFS (JFS) releases (e.g. 4.1 or later) with the latest version layout and mount options that meet your needs, but offer the best performance, you can probably achieve some gains.

Have a look at thie white paper on JFS performance and tuning:

http://docs.hp.com/en/5576/JFS_Tuning.pdf

Another good source of mount options as they relate to performance for VxFS filesystems is the manpages for 'mount_vxfs'. You might find, for instance that mounting with 'noatime' helps speed up your filesystem searches if this is their predominate activity.

http://docs.hp.com/en/B2355-60105/mount_vxfs.1M.html

Regards!

...JRF...
Joe Odman
Occasional Advisor

Re: millions of files per directory

This is the most appropriate response so far. We are on JFS layout 4 now, but will look at migrating ot layout 5. We may also test the noatime. The Netapps WAFL filesystem claims to be better yet, but this is not quantified. I also found that the ReiserFS Version 4 (SUSE) handles a million files in a directory efficiently. Still looking for a better HP-UX solution.
A. Clay Stephenson
Acclaimed Contributor

Re: millions of files per directory

I would get a baseball bat and use it on whatever developer or vendor came up with this scheme but if you are looking for solutions that insist that the application be maintained as is, I would say that the only viable alternative is a solid-state disk. Directory searches are linear and it will still require on average n/2 accesses to find a given file; at least with a solid-state disk (backed up automatically and transparently with conventional disks), these searches will be as fast as possible.

You are essentially using the directory as a database - something that it was never intended to do.
If it ain't broke, I can fix that.
Thomas J. Harrold
Trusted Contributor

Re: millions of files per directory

I had investigated "content" storage appliances, such as EMC's Celerra several years ago. They used a diffent approach, called "node based" storage, where the solution is actually a group of small servers, each with their own local storage, a set of master nodes, used to arrange the storage. EMC claimed that this system could handle millions of files.

Could I ask why? Could you work out a front-end access script/program that would sort and store programs in separate directies instead?

-tjh
I learn something new everyday. (usually because I break something new everyday)
Bill Hassell
Honored Contributor

Re: millions of files per directory

> I had investigated "content" storage appliances, such as EMC's Celerra several years ago. They used a diffent approach, called "node based" storage, where the solution is actually a group of small servers, each with their own local storage, a set of master nodes, used to arrange the storage. EMC claimed that this system could handle millions of files.

The product is called Centera and it is the first successful commercial product using CAS (content addressable storage). Millions of files are trivial -- we have dozens of terabytes of small files on Centeras. NOTE: there is no directory structure at all so you need a database to keep track of the special name for each file. Also, performance is limited and really designed for low volume access such as data archiving -- somewhere between disk arrays and tape silos.

As Clay mentioned, you have a developer problem that sounds suspiciously like a way to avoid buying a real database program (ie, every part number in the company has a small file with data in it). VxFS does not have any practical limitation on inodes since they are
built dynamically. AS long as you have space, you can add more files.

But don't do searches (ie, ls * or find, etc) and expect anything faster than minute responses. Once the developer tries to make this work, you'll probably get Version 2 of the software where a small database tracks all the files with binary searches, or maybe an hash algorithm...hummm, who knows, you may end up with Version 3 which is a real database.

Note: if you want the best possible performance with millions of files, you MUST install 11.31 as there are specific enhancements for massive directories -- and no, it still won't perform like a real database.


Bill Hassell, sysadmin
dirk dierickx
Honored Contributor

Re: millions of files per directory

well, we can all agree that the application is a prime example of bad design.

perhaps, reiserfs can handle these things (but that is not available on hpux), it is not only the filesystem itself that has to be effecient with it.

try as hard as you can to get this stuff into a database, that is what these things are made for and what they do best. they will outperform any filesystem for sure.
Joe Odman
Occasional Advisor

Re: millions of files per directory

Thank you all for your responses. I will look into Celerra.

Bill, I assume you are referring to layout 5 of vxfs 3.5? We are installing it for testing now.

I have also found XFS, developed by SGI, and ported to Suse. XFS can also support a large number of files in a directory. We will be testing this also.

Thomas J. Harrold
Trusted Contributor

Re: millions of files per directory

Bill is correct, it's "centerra", not "celerra". Celerra is an NFS head for an EMC storage array. I've worked with both, and always got the names confused. :)

-tjh
I learn something new everyday. (usually because I break something new everyday)
Bill Hassell
Honored Contributor

Re: millions of files per directory

That's correct. The version 5 directory structure (and associated subsystem code) is much more efficient with directory scans. Note that VxFS (any version) can have essentially unlimited files. What you do to manipulate the the structure is where you'll have the really bad performance. If you know the exact full pathname of a file, you'll have great performance. An ls or find will be painfully slow on version 3 and 4.


Bill Hassell, sysadmin