Languages and Scripting

select random lines from a very large file

 
SOLVED
Go to solution
Mike Patterson
Frequent Advisor

select random lines from a very large file

I've been using RANDOM to generate random line numbers from a file with several thousand lines (where it works fine). Now, I want to do the same thing on a file with 2 million lines. RANDOM is generating positive and negative numbers on that file. I'm guessing the limitation is 32767 for RANDOM. Any ideas on getting random lines sampled from the large file?

Code snippet:

LOG=/tmp/mylog
total=`wc -l "$LOG" | awk {'print $1'}`
# echo "total: $total"

x=0
while [ $x -lt 50 ]
do

# Get a random number:
rand=$(((($RANDOM*$total)/32767)+1))
# echo "Random line number is $rand"

x=`echo "$x + 1" | bc`

done
7 REPLIES 7
James R. Ferguson
Acclaimed Contributor

Re: select random lines from a very large file

Hi Mike:

You could do:

# rand=$(perl -le '$x=1;$y=2_000_000;$n=$x+int rand($y-$x+1);print $n')

...which will return a random integer between 'x' and 'y' or in this case between 1 and 2,000,000 inclusive.

Regards!

...JRF...
Mike Patterson
Frequent Advisor

Re: select random lines from a very large file

JRF -

Your perl command works good on a command line, but when I paste it into my script, I get this:

Undefined subroutine &main::RAND called at

I've tried a few variations (including full path to perl), but the error persist.

- Mike P
James R. Ferguson
Acclaimed Contributor

Re: select random lines from a very large file

Hi Mike:

> Undefined subroutine &main::RAND called at

...means you didn't copy-and-paste because 'rand' isn't uppercase in Perl :-)

Regards!

...JRF...
Mike Patterson
Frequent Advisor

Re: select random lines from a very large file

James -

Right on. After placing the perl line, I had done a global replacement on rand to RAND (because I like upper-case variables). Of course, it change the perl rand to RAND.

Thanks, I've got this running in my script and randomly selecting lines in a 2 million line file.

- Mike
James R. Ferguson
Acclaimed Contributor
Solution

Re: select random lines from a very large file

Hi (again) Mike:

> Thanks

If you are happy with the answers you received, please don't forget to assign points:

http://h30499.www3.hp.com/t5/help/faqpage/faq-category-id/kudos#kudos


Regards!

...JRF...

Mike Patterson
Frequent Advisor

Re: select random lines from a very large file

James -

I generally wait a bit for multiple answers before assigning points.

But, I'm quite happy to have my script running (it's doing a cksum on randomly selected identical files in two locations). So, you've got 10 points from me.

Thanks again,

Mike
Dennis Handly
Acclaimed Contributor

Re: select random lines from a very large file

>Any ideas on getting random lines sampled from the large file?

How often do you do this? Do you care about performance?
Perhaps you should select files based on the next full line after a random byte offset.

>x=`echo "$x + 1" | bc`

You can use: (( x += 1 ))
Or if you use: typeset -i x
x=$x+1