Re: selecting lines from huge files

Henk Geurts · ‎09-25-2008

Hi all

I have got files (>1.000.00 lines) with lines like :
10000000000000666447024 1887282889 2000828080826 W+000000000,00UR

now I have got to select all lines containing
certain numbers in caracters 3 to 17..

My file containing these numbers is 1.400.000 lines..
looking like
..
000000001853208
000000001853210
000000001853211
000000001853214
..

I am looking for an efficient and quick way..
(tried using for-loops/while loops , but was not effective.)

Any solutions.. ? perl ? awk ?

Oviwan · ‎09-25-2008

Hi

where sould this number be?
the last to character of each line sould be between 3 and 17 inclusive?

Regards

James R. Ferguson · ‎09-25-2008

Hi Henk:

This is similar to your previous query:

http://forums12.itrc.hp.com/service/forums/questionanswer.do?threadId=1270250

That said, one way (using Perl) would be (by example:

# perl -ne '$region=substr($_,2,7);print if ($region==1853208 or $region==1853210)' file

When using Perl (in lieu of 'awk') things are zero-relative. Hence, character #2 would be character-3 in'awk'.

If you post more specific match requirements we might compose a better approach.

Regards!

...JRF...

Henk Geurts · ‎09-25-2008

Thanks ..
Once again I should make myself more clear..

I attached a short version of the "number" file (K_NO)
I would like each line of this file to be checked to each line of the other file. When matched to caracter 3-17 of this other file.. -> print the complete line of this other file.

James R. Ferguson · ‎09-25-2008

Hi (again) Henk:

OK, here's another approach adopts to your use of a second file to define the patterns to match:

# cat ./match.pl
#!/usr/bin/perl
use strict;
use warnings;
my @tokens;
my @strings;
die "Usage: $0 tokenfile file ...\n" unless @ARGV > 0;
my $tokenf = shift;
open( FH, "<", $tokenf ) or die "Can't open '$tokenf': $!\n";
chomp( @tokens = );
close FH;
push @strings, $_ for @tokens;
while (<>) {
for my $match (@strings) {
if (m/.{2}$match/) { #...adjust as needed
print "$_";
last;
}
}
}
1;

...run as:

# ./match.pl file_of_tokens file

That is, the "file_of_tokens" is your attachement of strings to be matched in "file".

Once again, you say position-3 and I counted that as postition-2 (zero relative) so you may need to adjust the code above as annotated.

Regards!

..JRF...

Dennis Handly · ‎09-25-2008

With such large files, you don't want to use "grep -f" nor for/while.

With such large files, you could consider sorting both files then doing a "merge" to do the selection. This would mean you would have to change your selection file to get the keys in the same columns.

Some other threads about large number of records:
http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1110743
http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1136435
http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1165850

Or write a customize program to do what JRF's perl script does.

Ken Martin_3 · ‎09-26-2008

Give grep a try

# skey = desired match columns 3 to 17

skey="000000001853208"

# cat inputfile to grep
# "/^ beginning of line.
# .. any first two characters
# ${skey} what we are really looking for /"

cat inputfile ^
grep "/^..${skey}/" >outputfile

Dennis Handly · ‎09-26-2008

>Ken: Give grep a try

If you read Henk's comments about 1 million lines and 1.4 million selections and my reply and the URLs I provided, you don't dare want to use grep -f. That's on the order of 1E12 compares.

Ken Martin_3 · ‎09-28-2008

Dennis,

Yes, I see your point.

Now thinking back I too had problems reading very large files but can't remember how I did it.

Thanks

Fredrik.eriksson · ‎09-29-2008

Hi,

Maybe "comm" could do this for you. But I'm guessing since you want to search and match specific types of lines you'll probably need to do some sort of regular expression.

sed and awk can do this as well as grep/egrep but they're all quite "slow" in doing it when the files are so large.

If the differances between the files will minimize the output given I would do something like this:
# comm -2 File1 File2 | egrep "[0]+[0-9]+[[3-9]|1[0-7]]$"

The regexp is searching for anything that starts with 1 or more zero's, then 1 or more numeric value between 0-9. The last part is the magic where it searches for the value between 3-17 (by saying that either 3-9 or 10-17 is okey). I haven't tested this so I'm not sure it works :P please correct me if I missed something.

Best regards
Fredrik Eriksson

Dennis Handly · ‎09-29-2008

>Fredrik: Maybe "comm" could do this for you.

Yes, if the files are sorted and have the same contents, neither is the case here.

>match specific types of lines you'll probably need to do some sort of regular expression.

These are unique keys. Unless you mean to use the RE to just shift the key position.

>awk can do this but ... quite "slow" in doing it when the files are so large.

You are confused. If you sort the two input files, and reformat the records, it would be a simple linear pass.
I'm not sure how good awk's associative arrays are but that may also work.

>value between 3-17

That was columns 3 through 17.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: selecting lines from huge files

selecting lines from huge files