- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Re: Derive a unique number from a file name ?
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-03-2007 09:19 PM
тАО08-03-2007 09:19 PM
Derive a unique number from a file name ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-03-2007 10:04 PM
тАО08-03-2007 10:04 PM
Re: Derive a unique number from a file name ?
A file name may be short enough that something like MD5 doesn't work well at providing unique hashes. (I don't know the science behind these functions to be certain.)
The VMS file id will be unique on a per disk basis, so if the files are spread across a number of disks, then you'll need something else as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-03-2007 10:38 PM
тАО08-03-2007 10:38 PM
Re: Derive a unique number from a file name ?
Short of keeping a persistent table of filenames that are assigned numbers, the simple answer is NO.
Filenames (including directory and device names) can be rather lengthy. Any scheme that takes a VERY long string and reduces it to a fixed length number will have some probability of producing identical fixed length numbers from different strings. Generically, these indexing techniques are referred to as "hashing" (the functions used to collapse arbitrary, or somewhat arbitrary strings) to an index if fixed range are referred to as "hashing functions". The classic reference for this is Donald Knuth's "Art of Computer Programming: Volume 3: Sorting and Searching", Chapter 6, in particular Section 6.4 (admittedly this is a classic reference, it is copyright 1973).
If it is possible to identify some limits on the filenames that turn the hash function into a strict compression function, then the above warning about collisions does not apply.
The same restriction applies to the cryptographic hashes (e.g., SHA-1, MD-5). It is effectively an implication of basic Information Theory. Collapsing information in such a way that it cannot be reconstituted means that there will be collisions that without the original data are not resolvable.
If the filenames can be restricted, such that a compression scheme works appropriately, rather than a hashing scheme, that is, of course, a far different situation. However, I would recommend great caution. It is not difficult to imagine how such a scheme could be violated unknowingly in the future, with serious repercussions.
If I have been unclear, or can be of further assistance, please let me know.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-04-2007 03:25 AM
тАО08-04-2007 03:25 AM
Re: Derive a unique number from a file name ?
Even on ODS5, where a complete file
specification can be up to 4095 characters
long, that's less than 32K bits. Of course,
if you were looking for a number more like
32 bits than 32K bits, then, for the usual
meaning of "unique", you're probably out of
luck. Unless you wish to _rely_ on luck to
avoid hash collisions.
> [...] no versions are involved [...]
Adding another 16 bits would be the least of
your problems.
If you're willing to relax the uniqueness
requirement, then one of the popular hash
functions could probably be made to serve,
but then you do need to handle the hash
collision cases.
There's an old saying about trying to put 32K
pounds of something into a 32-pound sack, but
I can't quite remember the details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-04-2007 03:53 AM
тАО08-04-2007 03:53 AM
Re: Derive a unique number from a file name ?
I can think of various possibilities here, and the answers can differ based on the particular local requirements.
What OpenVMS version and what platform?
As for the literal question, there are various approaches toward this hashing.
There's the cryptographic-based MD5 approach. This is comparatively secure against tampering.
There's the GNU perfect hash around. gperf.
There are (many) other hash functions around.
CHECKSUM is intended to spot "innocent" file corruptions and file differences. It's not intended as a hashing function. The OpenVMS V8.2 and later (this is why I ask for version info!) CHECKSUM does feature MD5, FWIW. MD5 also has some issues with OpenVMS VAX (this is why I ask that platform info be included) and particularly with the VCG code generator used by the C compilers there. There are workarounds.
It also appears the design here is heading toward the creation of a relational database here, and I'd investigate MySQL or (if you can locate or can port it) ProgreSQL, or one of the commercial database packages. Rolling your own private database built on RMS -- having seen the results in any number of projects over the years -- starts out looking quite simple and a good and fast approach, and it often seems to degenerate into real work and real problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-04-2007 10:38 AM
тАО08-04-2007 10:38 AM
Re: Derive a unique number from a file name ?
If I understand your question correctly, you are still going to be storing the complete filename and attributes in the record. Is the hash being done to allow faster lookups, because the buckets will be able to hold more of the keys?
As others have noted, there is no way to guarantee a unique hash if the input to the hash has a larger range than can be represented by the number of bits in the hash. The normal method of dealing with this is to allow for duplicate values of the hash. As long as you use prologue 3 indexed files, you can use a duplicate primary key. Then you would hash the filename read the first record with that key. If the filename stored in that record matches what you hashed, you are done. Otherwise, read the next record, if the key value has changed, then the filename does not exist in your database; if the key value is the same, you are reading another filename that hashed to the same value, if it matches, you have found your filename. Keep reading/checking until the key value changes, or the filename is found.
Some things to consider:
How many records are you planning to store? This will affect the length of the hash you choose. Choose something large enough that collisions will be infrequent, you don't want to have hashes that are the same for 20 files. And expect some duplicates even if you use a hash that is much larger than the number of files. Google "birthday problem" for the reason.
I would also suggest normalizing the name before it is hashed. By this I mean make it something that will be as unique as is possible. You will need to decide how you want to handle rooted file specifications, for example ones link sys$manager:systartup_vms.com
I would not use physical device names; instead I would use what gets returned by f$getdvi(dev,"LOGVOLNAM"). And I would probably use a name similar to what lib$fid_to_name returns, i.e. what you see in the output of show device/files.
But since only you know what your intentions are, those are decisions you will need to make.
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 11:25 AM
тАО08-05-2007 11:25 AM
Re: Derive a unique number from a file name ?
Do you mean unique number for a "FILE NAME" or a unique number for a specific file?
If it's the latter, then why not use the FID? It's guaranteed to be unique on the particular device. Indeed, it's used for almost exactly the same purpose you're describing (ie: an index into INDEXF.SYS to store attributes of the file).
The FID is comprised of three 16 bit numbers. As long as you're not using bound volume sets, you can ignore the 3rd number.
Find the FID with
fid=F$FILE_ATTRIBUTES(file,"FID")
to find a file given a FID, use
file=F$FID_TO_NAME(device,fid)
(V8.2 or higher?)
If you want a 32 bit number try:
$ fid=F$FILE(file,"FID")-"("-")"
$ unq==F$INTEGER(F$ELEMENT(1,",",fid))*32768+F$INTEGER(F$ELEMENT(0,",",fid))
and back the other way:
$ dev=F$PARSE(yourdev,,,"DEVICE")
$ s=F$INTEGER(unq)/32768
$ f=F$INTEGER(unq)-s*32768
$ name==F$FID_TO_NAME(dev,"''f',''s'")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 11:53 AM
тАО08-05-2007 11:53 AM
Re: Derive a unique number from a file name ?
With all due respect, this is one of the few times that I will take fairly strong exception.
The File ID is only guaranteed so long as the volume is intact, or restored in certain, somewhat limited cases.
If the intent is to create a uniform manner to connect data structures to files, a hash of the filename (not to be confused with a cryptographic checksum, which is also, regrettably, referred to as a "hash") is the only way to do it.
There are issues to be sorted out with regards to how one handles logical names. I also strongly recommend that those reading this thread review my earlier citation to Knuth Volume 3.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 12:28 PM
тАО08-05-2007 12:28 PM
Re: Derive a unique number from a file name ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 01:44 PM
тАО08-05-2007 01:44 PM
Re: Derive a unique number from a file name ?
800,000 characters, plus overhead, for the log.
I'd investigate other approaches, including NFS over a VPN. Or I'd look at ODBC or such, and connecting the environments as a database. "Store and forward" file transfer is a tried and true technique, though there are other approaches that might potentially be brought to bear.
I would probably look to add a date into each record, and particularly some way to be able to flush out old entries.
The FID is a bad idea for tracking stuff, unless you (also) want to have a rebuild tool. FIDs do change when volumes get reloaded. And FIDs don't help with transfered files for this case (with the added details from Mr Ritter), since it's the name that's key here (pun intended); these files are arriving from a Linux box.
I'd probably connect the transfer between the two nodes with a database. It might look like overkill, yes, but you're (already) building your own transactional software here...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 05:43 PM
тАО08-05-2007 05:43 PM
Re: Derive a unique number from a file name ?
Ah, that's a much simpler problem ;-)
>Lots of these files are gigabytes in size.
>So keeping files indefinitely on disk is
>not a long term option.
Maybe you can't keep the *contents* of the files long term, but there's nothing to stop you from keeping a directory entry for a file of the same name (with no contents).
That simplifies your logic significantly. No need to think about anything other than just comparing the two lists of names. Since you're not using version numbers, you could use a convention that (say) ;1 is a "real" file with read data, and ;2 or higher is an empty shell. Another possibility is to have 2 directories in a search list. One for real files, and one for the empty shells.
In effect you're using the (or a) directory as your index file. Dates are maintained for you automatically as creation and/or modification dates. You can even delete the contents of the file, while preserving the directory entry:
$ COPY/OVERLAY NL: BIG_FILE.DAT
$ SET FILE/END BIG_FILE.DAT
(file is now empty but creation date is unchanged).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-05-2007 09:51 PM
тАО08-05-2007 09:51 PM
Re: Derive a unique number from a file name ?
Thank you for the clarification as to precisely what is intended. It does help.
Offhand, two approaches are logical, there is a tradeoff between short term implementation effort and long term efficiency.
One method, like that used in the MX mail processing package, is to use the file system itself to store the information. In your case, files with name identical to the names of the files on the Linux box could be created in a series of sub-directories (to enable searching, the sub-directories are all part of a single logical name searchlist). Which sub-directory the file resides in depends on something reasonably uniform (e.g., a uniformly distributed character in the filename). Each of these files is very short, say one line. The creation date of each file is the time of the successful transfer.
The second approach uses the 80 character filename as the primary key of an RMS indexed file. Individual records can be created, modified, and deleted from DCL using the READ/WRITE commands.
If the code is written well, one can even migrate from one approach to another with little pain.
Of course, in both cases, the efficiency will likely benefit from periodic reorganization of the directories or indexed files, depending on the approach.
I hope that the above is helpful. If I can be of additional assistance, please let me know.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-12-2007 11:43 PM
тАО08-12-2007 11:43 PM
Re: Derive a unique number from a file name ?
Regards
Paul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 01:25 AM
тАО08-13-2007 01:25 AM
Re: Derive a unique number from a file name ?
With all due respect, I must DISAGREE with your observation about CRC-32. While CRC codes are very good at identifying if a string (buffer) has been altered, they are far from unique (higher math omitted in the interest of brevity). What they do provide is guarantees that a certain number of altered bits will be detected.
This is far from uniqueness (as can easily be implied by the use of SHA-1 et al for validation rather than CRC).
For most lookup purposes, simple xors of substrings divided by a modulus (see my previous Knuth citation) are quite effective.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 02:36 AM
тАО08-13-2007 02:36 AM
Re: Derive a unique number from a file name ?
in this case there really is only one correct answer, and Bob just gave that.
Mathematics is a very fine art, but it can also be a harsh master. There is just NO way that any hashing guarantees freeness of duplicates, and CRC is "just" that. It only works because any two reductions giving the same answer is exceedingly small. But multiplied by gig number of cases, the chance of =any= two colliding is equivalent to the "same birthday" problem: the chance of a collision a sample of N cases increases with the square of N.
The CRC-32 has a uniqueness in the order 2**32, or 10**10.
The chance of "a" collision will surpass 50% already at "only" 10**5, or 100K instances...
Far fatter chance than winning big in the lotteries. But it happens with every draw...
hth
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 09:50 AM
тАО08-13-2007 09:50 AM
Re: Derive a unique number from a file name ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 12:36 PM
тАО08-13-2007 12:36 PM
Re: Derive a unique number from a file name ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 09:21 PM
тАО08-13-2007 09:21 PM
Re: Derive a unique number from a file name ?
Thanks for the education. I came across CRC first as a result of networking background (not too suprisingly) and then used it (successfully) to provide fast lookup in RMS files. I always understood that duplication (or collision) was a certainty at some point but couldn't calculate where that was likely to happen. as little as 100k items? Really? Hmmm maybe my app was more good luck that good design....?
Here is the exmple code. I cut it out of the system but left the subroutine intact so you should be able to place this in a library, link aganst it and use it as is. I also left the original comments in for some guidance (though you can ignore most of them) and to help you understand how you may want to modify it. Warning: As in the comments, the complete input string is not crc'd.
Coments welcome. As old as this is improvements always possible
Paul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-13-2007 09:25 PM
тАО08-13-2007 09:25 PM
Re: Derive a unique number from a file name ?
CRC will be far less efficient than simply breaking the filename into 32 bit (4 character) segments, exclusively or'ing the segments, and then dividing the result by an appropriately scaled prime number.
The family of CRC functions is designed to detect changes in bits. They are not designed to produce uniformly distributed mappings.
Hashing functions are designed to produce uniformly distributed results. Hash functions have been used for this purpose for decades.
As Jan noted, the math is the math.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-14-2007 10:05 PM
тАО08-14-2007 10:05 PM
Re: Derive a unique number from a file name ?
http://en.wikipedia.org/wiki/Birthday_attack
It gives the number of random 32 bit numbers you would need to generate to get more than a 50% chance of getting a duplicate as approximately 77000. But by the time you get to 200000, you have greater than 99% change of a collision. In other words, when you have generated 5% of the 2^32 random values, you have more than a 99% probability of having generated at least one duplicate value.
Therefore, you must be prepared to deal with duplicates when using hashes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-14-2007 10:26 PM
тАО08-14-2007 10:26 PM
Re: Derive a unique number from a file name ?
>>>
But by the time you get to 200000, you have greater than 99% change of a collision. In other words, when you have generated 5% of the 2^32 random values, you have more than a 99% probability of having generated at least one duplicate value.
<<<
Well, in _MY_ arithmetic 5% of 2**32 would be about 1000 times your 200000, so it is not nearly as bad as you make it, but your 50% value of 77000 _IS_ correct (like I wrote, in the order of 10**5).
I feel like simple arithmetics deserves this correction.
fwiw
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2007 01:29 AM
тАО08-15-2007 01:29 AM
Re: Derive a unique number from a file name ?
So my last response HAD its point, but the reason is the other way around! The 99% chance for about 200K cases is correct, but that is _NOT_ 5% of the pick, 99% is reached already at 0.005 % !!
Should teach me not to react to only one part of the picture! :-(
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2007 04:38 AM
тАО08-15-2007 04:38 AM
Re: Derive a unique number from a file name ?
Thanks for pointing out my flawed calculation.
The point the we both were trying to make is that you don't need to get to a very large number of items before the probability of duplication becomes quite high, and the increase with number ├в drawn├в is not linear.
The reason is that with each "draw", it gets harder and harder not to get a duplicate, because the unused pool is getting smaller and smaller. Jan said this in a previous note.
P.S. If someone is going to test this using a computer simulation, you must not use a pseudo random generator that has a cycle close to 2^32. That's like giving someone a deck of 52 cards, shuffling them well, and then asking them to take one card at a time without replacement until they get a duplicate. (if the deck was not flawed, they never will get a duplicate). That's different than if they pull one, add it back, reshuffle, etc. You need to use a generator with a much higher cycle, (like near 2^48 or 2^64) and use only 32 bits of each "sample").
Or even better, use the hashes of filenames like the ones you plan to use.
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2007 05:54 AM
тАО08-15-2007 05:54 AM
Re: Derive a unique number from a file name ?
Your comments are PRECISELY the reason that my original response on this topic referenced Knuth's book.
There is an extensive literature on this subject. It is far safer to spend some time reviewing the literature than it is to re-discover all of the issues from scratch.
- Bob Gezelter, http://www.rlgsc.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2007 10:42 AM
тАО08-15-2007 10:42 AM
Re: Derive a unique number from a file name ?
Bob, this is extremely good advice, and applies in the general case as well as this specific one.
There really is no longer a good excuse not to do some preliminary searches for prior experience, since it is now so easy to do with search engines. There are surprisingly few problems that haven't had previous work done on them. It must be human nature to think you have discovered some new problem, which has no prior art. I just read an article that Hoff referenced on his reading list, http://64.223.189.234/node/450 in which he references http://www.shirky.com/writings/group_enemy.html
One observation made in that article is that people keep "discovering" the same things, even though someone has written about it.
This was true in DEC as well, where many good ideas from TOPS-10 and TOPS-20 were lost (ignored?) when VMS was being developed But perhaps that was more political than technical, as the two groups were "competing".
And, I agree with you about Knuth. His books are Classics, although I don't consider them "light reading" (not that you claimed they were).