Re: Derive a unique number from a file name ?

Thomas Ritter · ‎08-03-2007

I want to generate a unique number from a file name. This number will then be used as a primary key for storing the file name and other attributes into an indexed file. I’m familiar with the checksum verb and wonder if something similar is available or could be written just for a file name. The file name will always be unique, that is no versions are involved. What methods are available for deriving a unique file number from the name?

Richard Whalen · ‎08-03-2007

I think that there are two options - find a hashing function with a low probability of collisions, or use the already unique file-id that VMS has given the file.

A file name may be short enough that something like MD5 doesn't work well at providing unique hashes. (I don't know the science behind these functions to be certain.)

The VMS file id will be unique on a per disk basis, so if the files are spread across a number of disks, then you'll need something else as well.

Robert Gezelter · ‎08-03-2007

Thomas,

Short of keeping a persistent table of filenames that are assigned numbers, the simple answer is NO.

Filenames (including directory and device names) can be rather lengthy. Any scheme that takes a VERY long string and reduces it to a fixed length number will have some probability of producing identical fixed length numbers from different strings. Generically, these indexing techniques are referred to as "hashing" (the functions used to collapse arbitrary, or somewhat arbitrary strings) to an index if fixed range are referred to as "hashing functions". The classic reference for this is Donald Knuth's "Art of Computer Programming: Volume 3: Sorting and Searching", Chapter 6, in particular Section 6.4 (admittedly this is a classic reference, it is copyright 1973).

If it is possible to identify some limits on the filenames that turn the hash function into a strict compression function, then the above warning about collisions does not apply.

The same restriction applies to the cryptographic hashes (e.g., SHA-1, MD-5). It is effectively an implication of basic Information Theory. Collapsing information in such a way that it cannot be reconstituted means that there will be collisions that without the original data are not resolvable.

If the filenames can be restricted, such that a compression scheme works appropriately, rather than a hashing scheme, that is, of course, a far different situation. However, I would recommend great caution. It is not difficult to imagine how such a scheme could be violated unknowingly in the future, with serious repercussions.

If I have been unclear, or can be of further assistance, please let me know.

- Bob Gezelter, http://www.rlgsc.com

Steven Schweda · ‎08-04-2007

The file name is _already_ a unique number.
Even on ODS5, where a complete file
specification can be up to 4095 characters
long, that's less than 32K bits. Of course,
if you were looking for a number more like
32 bits than 32K bits, then, for the usual
meaning of "unique", you're probably out of
luck. Unless you wish to _rely_ on luck to
avoid hash collisions.

> [...] no versions are involved [...]

Adding another 16 bits would be the least of
your problems.

If you're willing to relax the uniqueness
requirement, then one of the popular hash
functions could probably be made to serve,
but then you do need to handle the hash
collision cases.

There's an old saying about trying to put 32K
pounds of something into a 32-pound sack, but
I can't quite remember the details.

Hoff · ‎08-04-2007

What particular problem or application issue might you be addressing here?

I can think of various possibilities here, and the answers can differ based on the particular local requirements.

What OpenVMS version and what platform?

As for the literal question, there are various approaches toward this hashing.

There's the cryptographic-based MD5 approach. This is comparatively secure against tampering.

There's the GNU perfect hash around. gperf.

There are (many) other hash functions around.

CHECKSUM is intended to spot "innocent" file corruptions and file differences. It's not intended as a hashing function. The OpenVMS V8.2 and later (this is why I ask for version info!) CHECKSUM does feature MD5, FWIW. MD5 also has some issues with OpenVMS VAX (this is why I ask that platform info be included) and particularly with the VCG code generator used by the C compilers there. There are workarounds.

It also appears the design here is heading toward the creation of a relational database here, and I'd investigate MySQL or (if you can locate or can port it) ProgreSQL, or one of the commercial database packages. Rolling your own private database built on RMS -- having seen the results in any number of projects over the years -- starts out looking quite simple and a good and fast approach, and it often seems to degenerate into real work and real problems.

Jon Pinkley · ‎08-04-2007

Thomas,

If I understand your question correctly, you are still going to be storing the complete filename and attributes in the record. Is the hash being done to allow faster lookups, because the buckets will be able to hold more of the keys?

As others have noted, there is no way to guarantee a unique hash if the input to the hash has a larger range than can be represented by the number of bits in the hash. The normal method of dealing with this is to allow for duplicate values of the hash. As long as you use prologue 3 indexed files, you can use a duplicate primary key. Then you would hash the filename read the first record with that key. If the filename stored in that record matches what you hashed, you are done. Otherwise, read the next record, if the key value has changed, then the filename does not exist in your database; if the key value is the same, you are reading another filename that hashed to the same value, if it matches, you have found your filename. Keep reading/checking until the key value changes, or the filename is found.

Some things to consider:

How many records are you planning to store? This will affect the length of the hash you choose. Choose something large enough that collisions will be infrequent, you don't want to have hashes that are the same for 20 files. And expect some duplicates even if you use a hash that is much larger than the number of files. Google "birthday problem" for the reason.

I would also suggest normalizing the name before it is hashed. By this I mean make it something that will be as unique as is possible. You will need to decide how you want to handle rooted file specifications, for example ones link sys$manager:systartup_vms.com

I would not use physical device names; instead I would use what gets returned by f$getdvi(dev,"LOGVOLNAM"). And I would probably use a name similar to what lib$fid_to_name returns, i.e. what you see in the output of show device/files.

But since only you know what your intentions are, those are decisions you will need to make.

Jon

it depends

John Gillings · ‎08-05-2007

Thomas,

Do you mean unique number for a "FILE NAME" or a unique number for a specific file?

If it's the latter, then why not use the FID? It's guaranteed to be unique on the particular device. Indeed, it's used for almost exactly the same purpose you're describing (ie: an index into INDEXF.SYS to store attributes of the file).

The FID is comprised of three 16 bit numbers. As long as you're not using bound volume sets, you can ignore the 3rd number.

Find the FID with

fid=F$FILE_ATTRIBUTES(file,"FID")

to find a file given a FID, use

file=F$FID_TO_NAME(device,fid)

(V8.2 or higher?)

If you want a 32 bit number try:

$ fid=F$FILE(file,"FID")-"("-")"
$ unq==F$INTEGER(F$ELEMENT(1,",",fid))*32768+F$INTEGER(F$ELEMENT(0,",",fid))

and back the other way:

$ dev=F$PARSE(yourdev,,,"DEVICE")
$ s=F$INTEGER(unq)/32768
$ f=F$INTEGER(unq)-s*32768
$ name==F$FID_TO_NAME(dev,"''f',''s'")

A crucible of informative mistakes

Robert Gezelter · ‎08-05-2007

John,

With all due respect, this is one of the few times that I will take fairly strong exception.

The File ID is only guaranteed so long as the volume is intact, or restored in certain, somewhat limited cases.

If the intent is to create a uniform manner to connect data structures to files, a hash of the filename (not to be confused with a cryptographic checksum, which is also, regrettably, referred to as a "hash") is the only way to do it.

There are issues to be sorted out with regards to how one handles logical names. I also strongly recommend that those reading this thread review my earlier citation to Knuth Volume 3.

- Bob Gezelter, http://www.rlgsc.com

Thomas Ritter · ‎08-05-2007

I wrote a simple SFTP GET and PUT procedure from VMS to Linux. The VMS gets the files from the Linux drop box. I have to keep track of the files which have been copied. At each iteration of the DCL a directory listing , ls â l, is performed on the Linux and then selected files are copied to VMS. I need to ensure I do not copy files already copied. Using a sequential history file, I write the file name in the first field and time stamp in the next, and then I keep the ls â l listing in the last field. Checking for file copy history is performed using $search/key=(pos,len) â filenameâ and checking $severity. This works quite well at the moment with about 300 files. But over time I could be dealing with 10,000 files. I want to be able to perform a keyed lookup from DCL. Something like $ read/key=123456, where 123456 is say the uniquely derived file number and if the record is not found, copy the file otherwise skip. The file names are not more 80 characters long. I could use the file name as the key but better would be using a unique file number. Or maybe it makes no difference ? So in a nutshell I want to be check, with the least overhead, if a file has already been successfully copied from Linux to VMS. Lots of these files are gigabytes in size. So keeping files indefinitely on disk is not a long term option.

Hoff · ‎08-05-2007

If the filenames are always unique, use the filename as is. 80 characters isn't worth hashing. Why obfuscate?

800,000 characters, plus overhead, for the log.

I'd investigate other approaches, including NFS over a VPN. Or I'd look at ODBC or such, and connecting the environments as a database. "Store and forward" file transfer is a tried and true technique, though there are other approaches that might potentially be brought to bear.

I would probably look to add a date into each record, and particularly some way to be able to flush out old entries.

The FID is a bad idea for tracking stuff, unless you (also) want to have a rebuild tool. FIDs do change when volumes get reloaded. And FIDs don't help with transfered files for this case (with the added details from Mr Ritter), since it's the name that's key here (pun intended); these files are arriving from a Linux box.

I'd probably connect the transfer between the two nodes with a database. It might look like overkill, yes, but you're (already) building your own transactional software here...

John Gillings · ‎08-05-2007

Thomas,

Ah, that's a much simpler problem ;-)

>Lots of these files are gigabytes in size.
>So keeping files indefinitely on disk is
>not a long term option.

Maybe you can't keep the *contents* of the files long term, but there's nothing to stop you from keeping a directory entry for a file of the same name (with no contents).

That simplifies your logic significantly. No need to think about anything other than just comparing the two lists of names. Since you're not using version numbers, you could use a convention that (say) ;1 is a "real" file with read data, and ;2 or higher is an empty shell. Another possibility is to have 2 directories in a search list. One for real files, and one for the empty shells.

In effect you're using the (or a) directory as your index file. Dates are maintained for you automatically as creation and/or modification dates. You can even delete the contents of the file, while preserving the directory entry:

$ COPY/OVERLAY NL: BIG_FILE.DAT
$ SET FILE/END BIG_FILE.DAT

(file is now empty but creation date is unchanged).

A crucible of informative mistakes

Robert Gezelter · ‎08-05-2007

Thomas,

Thank you for the clarification as to precisely what is intended. It does help.

Offhand, two approaches are logical, there is a tradeoff between short term implementation effort and long term efficiency.

One method, like that used in the MX mail processing package, is to use the file system itself to store the information. In your case, files with name identical to the names of the files on the Linux box could be created in a series of sub-directories (to enable searching, the sub-directories are all part of a single logical name searchlist). Which sub-directory the file resides in depends on something reasonably uniform (e.g., a uniformly distributed character in the filename). Each of these files is very short, say one line. The creation date of each file is the time of the successful transfer.

The second approach uses the 80 character filename as the primary key of an RMS indexed file. Individual records can be created, modified, and deleted from DCL using the READ/WRITE commands.

If the code is written well, one can even migrate from one approach to another with little pain.

Of course, in both cases, the efficiency will likely benefit from periodic reorganization of the directories or indexed files, depending on the approach.

I hope that the above is helpful. If I can be of additional assistance, please let me know.

- Bob Gezelter, http://www.rlgsc.com

Paul Beaudoin · ‎08-12-2007

I'm suprised no one mentioned CRC Cyclical Redundancy Check). It is native to VMS in all versionsand produces a 32 bit number unique to any length string input which sounds like what was asked for. I used this to solve a similar problem once and though statistically it will eventually duplicate a number for different strings, it is a very low possibility (< 1 in 2.1 billion) If interested I can post the implementing (macro) code.

Regards

Paul

Robert Gezelter · ‎08-13-2007

Paul,

With all due respect, I must DISAGREE with your observation about CRC-32. While CRC codes are very good at identifying if a string (buffer) has been altered, they are far from unique (higher math omitted in the interest of brevity). What they do provide is guarantees that a certain number of altered bits will be detected.

This is far from uniqueness (as can easily be implied by the use of SHA-1 et al for validation rather than CRC).

For most lookup purposes, simple xors of substrings divided by a modulus (see my previous Knuth citation) are quite effective.

- Bob Gezelter, http://www.rlgsc.com

Jan van den Ende · ‎08-13-2007

Paul,

in this case there really is only one correct answer, and Bob just gave that.

Mathematics is a very fine art, but it can also be a harsh master. There is just NO way that any hashing guarantees freeness of duplicates, and CRC is "just" that. It only works because any two reductions giving the same answer is exceedingly small. But multiplied by gig number of cases, the chance of =any= two colliding is equivalent to the "same birthday" problem: the chance of a collision a sample of N cases increases with the square of N.
The CRC-32 has a uniqueness in the order 2**32, or 10**10.
The chance of "a" collision will surpass 50% already at "only" 10**5, or 100K instances...
Far fatter chance than winning big in the lotteries. But it happens with every draw...

hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Thomas Ritter · ‎08-13-2007

Paul, I would like to look at the code. Might be good enough for my application.

Graham Burley · ‎08-13-2007

Might be worth taking a look at the FTP_NEW DCL procedure available on the OpenVMS Freeware V8 and elsewhere, although it doesn't support sftp (only ftp). It uses an indexed file to track the remote files and their status, including if the file has been actioned (e.g. copied), using the remote filename as the key.

Paul Beaudoin · ‎08-13-2007

Gents,

Thanks for the education. I came across CRC first as a result of networking background (not too suprisingly) and then used it (successfully) to provide fast lookup in RMS files. I always understood that duplication (or collision) was a certainty at some point but couldn't calculate where that was likely to happen. as little as 100k items? Really? Hmmm maybe my app was more good luck that good design....?

Here is the exmple code. I cut it out of the system but left the subroutine intact so you should be able to place this in a library, link aganst it and use it as is. I also left the original comments in for some guidance (though you can ignore most of them) and to help you understand how you may want to modify it. Warning: As in the comments, the complete input string is not crc'd.

Coments welcome. As old as this is improvements always possible

Paul

Robert Gezelter · ‎08-13-2007

Thomas,

CRC will be far less efficient than simply breaking the filename into 32 bit (4 character) segments, exclusively or'ing the segments, and then dividing the result by an appropriately scaled prime number.

The family of CRC functions is designed to detect changes in bits. They are not designed to produce uniformly distributed mappings.

Hashing functions are designed to produce uniformly distributed results. Hash functions have been used for this purpose for decades.

As Jan noted, the math is the math.

- Bob Gezelter, http://www.rlgsc.com

Jon Pinkley · ‎08-14-2007

If you are interested in the math, here's an interesting page.

http://en.wikipedia.org/wiki/Birthday_attack

It gives the number of random 32 bit numbers you would need to generate to get more than a 50% chance of getting a duplicate as approximately 77000. But by the time you get to 200000, you have greater than 99% change of a collision. In other words, when you have generated 5% of the 2^32 random values, you have more than a 99% probability of having generated at least one duplicate value.

Therefore, you must be prepared to deal with duplicates when using hashes.

it depends

Jan van den Ende · ‎08-14-2007

Jon,

>>>
But by the time you get to 200000, you have greater than 99% change of a collision. In other words, when you have generated 5% of the 2^32 random values, you have more than a 99% probability of having generated at least one duplicate value.
<<<
Well, in _MY_ arithmetic 5% of 2**32 would be about 1000 times your 200000, so it is not nearly as bad as you make it, but your 50% value of 77000 _IS_ correct (like I wrote, in the order of 10**5).

I feel like simple arithmetics deserves this correction.

fwiw

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎08-15-2007

Bweuh!!!!

So my last response HAD its point, but the reason is the other way around! The 99% chance for about 200K cases is correct, but that is _NOT_ 5% of the pick, 99% is reached already at 0.005 % !!

Should teach me not to react to only one part of the picture! :-(

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jon Pinkley · ‎08-15-2007

Jan,

Thanks for pointing out my flawed calculation.

The point the we both were trying to make is that you don't need to get to a very large number of items before the probability of duplication becomes quite high, and the increase with number â drawnâ is not linear.

The reason is that with each "draw", it gets harder and harder not to get a duplicate, because the unused pool is getting smaller and smaller. Jan said this in a previous note.

P.S. If someone is going to test this using a computer simulation, you must not use a pseudo random generator that has a cycle close to 2^32. That's like giving someone a deck of 52 cards, shuffling them well, and then asking them to take one card at a time without replacement until they get a duplicate. (if the deck was not flawed, they never will get a duplicate). That's different than if they pull one, add it back, reshuffle, etc. You need to use a generator with a much higher cycle, (like near 2^48 or 2^64) and use only 32 bits of each "sample").

Or even better, use the hashes of filenames like the ones you plan to use.

Jon

it depends

Robert Gezelter · ‎08-15-2007

Jon,

Your comments are PRECISELY the reason that my original response on this topic referenced Knuth's book.

There is an extensive literature on this subject. It is far safer to spend some time reviewing the literature than it is to re-discover all of the issues from scratch.

- Bob Gezelter, http://www.rlgsc.com

Jon Pinkley · ‎08-15-2007

Bob>>>"It is far safer to spend some time reviewing the literature than it is to re-discover all of the issues from scratch."<<<

Bob, this is extremely good advice, and applies in the general case as well as this specific one.

There really is no longer a good excuse not to do some preliminary searches for prior experience, since it is now so easy to do with search engines. There are surprisingly few problems that haven't had previous work done on them. It must be human nature to think you have discovered some new problem, which has no prior art. I just read an article that Hoff referenced on his reading list, http://64.223.189.234/node/450 in which he references http://www.shirky.com/writings/group_enemy.html

One observation made in that article is that people keep "discovering" the same things, even though someone has written about it.

This was true in DEC as well, where many good ideas from TOPS-10 and TOPS-20 were lost (ignored?) when VMS was being developed But perhaps that was more political than technical, as the two groups were "competing".

And, I agree with you about Knuth. His books are Classics, although I don't consider them "light reading" (not that you claimed they were).

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Derive a unique number from a file name ?

Derive a unique number from a file name ?