Operating System - OpenVMS
1753516 Members
5261 Online
108795 Solutions
New Discussion юеВ

Re: Derive a unique number from a file name ?

 
Jan van den Ende
Honored Contributor

Re: Derive a unique number from a file name ?

Jon,

>>>
But by the time you get to 200000, you have greater than 99% change of a collision. In other words, when you have generated 5% of the 2^32 random values, you have more than a 99% probability of having generated at least one duplicate value.
<<<
Well, in _MY_ arithmetic 5% of 2**32 would be about 1000 times your 200000, so it is not nearly as bad as you make it, but your 50% value of 77000 _IS_ correct (like I wrote, in the order of 10**5).

I feel like simple arithmetics deserves this correction.

fwiw

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Jan van den Ende
Honored Contributor

Re: Derive a unique number from a file name ?

Bweuh!!!!

So my last response HAD its point, but the reason is the other way around! The 99% chance for about 200K cases is correct, but that is _NOT_ 5% of the pick, 99% is reached already at 0.005 % !!

Should teach me not to react to only one part of the picture! :-(

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Jon Pinkley
Honored Contributor

Re: Derive a unique number from a file name ?

Jan,

Thanks for pointing out my flawed calculation.

The point the we both were trying to make is that you don't need to get to a very large number of items before the probability of duplication becomes quite high, and the increase with number ├в drawn├в is not linear.

The reason is that with each "draw", it gets harder and harder not to get a duplicate, because the unused pool is getting smaller and smaller. Jan said this in a previous note.

P.S. If someone is going to test this using a computer simulation, you must not use a pseudo random generator that has a cycle close to 2^32. That's like giving someone a deck of 52 cards, shuffling them well, and then asking them to take one card at a time without replacement until they get a duplicate. (if the deck was not flawed, they never will get a duplicate). That's different than if they pull one, add it back, reshuffle, etc. You need to use a generator with a much higher cycle, (like near 2^48 or 2^64) and use only 32 bits of each "sample").

Or even better, use the hashes of filenames like the ones you plan to use.

Jon
it depends
Robert Gezelter
Honored Contributor

Re: Derive a unique number from a file name ?

Jon,

Your comments are PRECISELY the reason that my original response on this topic referenced Knuth's book.

There is an extensive literature on this subject. It is far safer to spend some time reviewing the literature than it is to re-discover all of the issues from scratch.

- Bob Gezelter, http://www.rlgsc.com
Jon Pinkley
Honored Contributor

Re: Derive a unique number from a file name ?

Bob>>>"It is far safer to spend some time reviewing the literature than it is to re-discover all of the issues from scratch."<<<

Bob, this is extremely good advice, and applies in the general case as well as this specific one.

There really is no longer a good excuse not to do some preliminary searches for prior experience, since it is now so easy to do with search engines. There are surprisingly few problems that haven't had previous work done on them. It must be human nature to think you have discovered some new problem, which has no prior art. I just read an article that Hoff referenced on his reading list, http://64.223.189.234/node/450 in which he references http://www.shirky.com/writings/group_enemy.html

One observation made in that article is that people keep "discovering" the same things, even though someone has written about it.

This was true in DEC as well, where many good ideas from TOPS-10 and TOPS-20 were lost (ignored?) when VMS was being developed But perhaps that was more political than technical, as the two groups were "competing".

And, I agree with you about Knuth. His books are Classics, although I don't consider them "light reading" (not that you claimed they were).
it depends
Robert Gezelter
Honored Contributor

Re: Derive a unique number from a file name ?

Jon,

I heartily concur. RTFT (Read The Fine Textbook) ought to be enshrined next to RTFM.

There are certainly any number of extremely competent algorithms and compilers texts describing the issues involved in hashed lookups and collisions. My standing recommendation is to find the text used in a first or second term graduate course algorithms course in a Computer Science department.

My reference to Knuth was mainly that his books are one of the oldest and most complete references.

- Bob Gezelter, http://www.rlgsc.com