Re: Check input file rows present or not present in output file

sathis kumar · ‎07-27-2009

Hello,

I have an input file with 5 lines as below:
B 412648-B21 20090701 TRUDIN QBL1
B 412648-B21 20090701 WAECDF QBL1
B 412648-B21 20090701 ZARDDF QBL1
B 412648-B21 20090701 ZARDDP QBL1
B 412648-B21 20090701 ZAUDDF QBL1

I have an o/p file with 5 lines as below:
B 412648-B21 20090701 TRUDIN +0000000000000255.00 F 20090701 01 A QBL1
B 412648-B21 20090701 WAECDF +0000000000000195.00 F 20090701 01 A QBL1
B 412648-B21 20090701 ZARDDF +0000000000000000.00 N A QBL1
B 412648-B21 20090701 ZARDDP +0000000000001710.00 F 20090701 01 A QBL1
B 412648-B21 20090701 ZAUDDF +0000000000000245.00 F 20090701 01 A QBL1

I have a requirement to find if the lines(data) in input file present in output file or not. Could you please let me know how this can be done with out affecting the performance
(execution time should not take much to check) ?

Note: only first 1-32 characters of the line (input file) needs to be checked with output file

Regards,
Sathish

James R. Ferguson · ‎07-27-2009

Hi Sathish:

> only first 1-32 characters of the line (input file) needs to be checked with output file

And in your sample input that would span the beginning of the line through the "L" character in the last field. Is that correct?

What if your output file had a record like:

B 412648-B21 20090701 ZAUDDF +0000000000000245.00 F 20090701 01 A XXX2

Would that be considered a match or not? (I would think not).

Regards!

...JRF...

James R. Ferguson · ‎07-27-2009

Hi (again):

By the way, Sathish:

You have unevaluated answers to many of your questions, including, but not limited to, your most recent two:

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1357438

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1304828

It would be appropriate to follow these guidelines:

http://forums11.itrc.hp.com/service/forums/helptips.do?#28

Regards!

...JRF...

sathis kumar · ‎07-27-2009

I am sorry about mentioning the field position incorrect.

It should be checked till the end of the 4th field in each line. ie.
B 412648-B21 20090701 TRUDIN
B 412648-B21 20090701 WAECDF
B 412648-B21 20090701 ZARDDF
B 412648-B21 20090701 ZARDDP
B 412648-B21 20090701 ZAUDDF

-sathish

James R. Ferguson · ‎07-27-2009

Hi:

You could do something like this.

Snip the last field from your input file:

# awk '{$NF="";print}' inputfile > tokenfile

Then use the file of tokens to match your output:

# grep -Ff tokenfile outputfile

Regards!

...JRF...

sathis kumar · ‎07-27-2009

Thanks James.

I need to print the missing lines that are available in input file but not in output file.

Could you please let me know how can this be done (with the faster execution time) ?

-sathish

James R. Ferguson · ‎07-27-2009

Hi (again) Sathish:

> I need to print the missing lines that are available in input file but not in output file.

# grep -v -Ff tokenfile outputfile

> Could you please let me know how can this be done (with the faster execution time) ?

'grep' is going to be quite fast unless you have very large input files (with tokens to be matched).

Regards!

...JRF...

sathis kumar · ‎07-27-2009

Thanks for your help.

I would need to use an input file with 50,000 no of lines. Just wanted to check if there are any other options (apart from grep) can be used to make it much faster?

James R. Ferguson · ‎07-27-2009

Hi Sathish:

> I would need to use an input file with 50,000 no of lines. Just wanted to check if there are any other options (apart from grep) can be used to make it much faster?

The question really becomes how often are you doing these comparisons? Are you really matching 50,000 tokens to N-many lines?

I might guess that given definitive information about the _scale_ of the input and output we might craft a faster solution than the 'grep' offering I have made.

Regards!

...JRF...

Dennis Handly · ‎07-27-2009

>I would need to use an input file with 50,000 # of lines.

For large number of lines, sorting the two files may make it faster. But since your two files don't match exactly, you can't directly use comm(1). You would have to use awk, perl or a program.

If the input file fits in memory, you could read it into a map/hash, then compare the output file.

Note: With the grep -v or the map solution, you can only determine if lines in the output file aren't in the input file. But not easily if the lines in the input file are missing in output.

sathis kumar · ‎07-29-2009

Thanks for all your help.

Please find below the exact requirement that we have:

My i/p file looks like:
B L1983A B1N 20090701 HUECDP QBLH
B L1983A B1N 20090701 HUHFDP QBL1

My o/p file looks like:
B L1983A B1N 20090701 HUECDP +0000000000000000.00 F 20090701 01 A QBL1
B L1983A B1N 20090701 HUHFDP +0000000000000000.00 F 20090701 01 A QBL1

1) I need to compare the lines in i/p file (1-38 characters) with o/p file and if matches then for those output I need to replace the last field value in o/p file with the corresponding one in the i/p file.

ie. above output should change like:
B L1983A B1N 20090701 HUECDP +0000000000000000.00 F 20090701 01 A QBLH
B L1983A B1N 20090701 HUHFDP +0000000000000000.00 F 20090701 01 A QBL1

You could observe the last field in the first line got changed from QBL1 to QBLH (as same as the one in i/p file)

2) If some lines present in the i/p file are missing in the o/p file then those lines
need to be captured in a new file

Note: We might need to do the testing with 5000,10000,20000 and even 50,000 of lines too. Hence need to check the performance of the script execution also.

Dennis Handly · ‎07-29-2009

>Hence need to check the performance of the script execution also.

Are your files sorted? If not, do you care if the output is sorted?

A close upper bound on the time would be to sort both files.

sathis kumar · ‎07-29-2009

Yes, the files are sorted

Dennis Handly · ‎07-29-2009

>the files are sorted

Then this is a simple no brainer and the performance is linear. Just do a "simple merge" and compare the records.
Probably easy to do in C or perl. Only a little harder in awk, since two input and two output files.

Hein van den Heuvel · ‎07-29-2009

If the records are sorted, and the in the same order there is not even a need to do the compare. You could just use:

$ awk '{new = $NF; getline < "b.txt"; regexp = $NF "$"; sub(regexp,new); print}' a.txt
B L1983A B1N 20090701 HUECDP +0000000000000000.00 F 20090701 01 A QBLH
B L1983A B1N 20090701 HUHFDP +0000000000000000.00 F 20090701 01 A QBL1

That remembers the last field a the line from the first file, reads the other file, replaces its last field with the one from the first and prints.

If the lines are sorted but potentially NOT equal then you will need to add some code to read along into whichever file that has fallen behind until caught up.

if file a is
10
12

and file b is
10
11
12

then the program has to skip that line 11 from b.

if file a is
10
11
13
and file b is
10
12
13
then the program needs to skip a line from file from each before processing.

Below an example of how to solve such program in awk.

Note, I used 28 instead of 38 in the example, because that's how the data showed up in the forum, and while you indicated 4 fields, you actually showed 5, so that's not to be trusted either.

Also please note how you wasted James's time by being imprecise initially.
You did NOT just need to find matching lines... for which GREP is perfect, but you also needed data from EACH provide file for which GREP is useless.

hope this helps,
Hein.

-------------- update.awk ----------------
BEGIN { a_skip = b_skip = c_lines = 0 }
{ a_match = substr($0,1,28)
a_last = $NF
while (a_match != b_match) {
if (a_match > b_match) {
b_skip++
if ((getline < "b.txt") != 1 ) { exit }
b_match = substr($0,1,28)
b_last = $NF
c = $0
}
if (a_match < b_match) {
a_skip++
if (getline != 1) { exit }
a_match = substr($0,1,28)
a_last = $NF
}
}
regexp = b_last "$"
sub (regexp, a_last, c)
print c
c_lines++
b_skip--
}
END { print c_lines " printed to C. " a_skip, " skipped from a, ", b_skip " from b." > "/dev/stderr"

}
-------------- sample execution ----------

/cygdrive/c/temp
$ awk -f update.awk < a.txt > c.txt
2 printed to C. 1 skipped from a, 0 from b.

/cygdrive/c/temp
$ cat c.txt
B L1983A B1N 20090701 HUECDP +0000000000000000.00 F 20090701 01 A QBLH
B L1983A B1N 20090701 HUHFDP +0000000000000000.00 F 20090701 01 A QBL1

Dennis Handly · ‎07-29-2009

>Hein: and the in the same order

Sathis said lines could be missing.

>for which GREP is perfect

grep might be terrible for 50 K records.

Here is my awk merge example with checking:

awk -v file=i_file -v err_file=err.out '
BEGIN { save = ""; EOF = 0 }
{
if (save == "") {
if (EOF || getline save < file <= 0) {
print "Missing in I file:", $0 > err_file
EOF = 1
save = ""
next
}
}
while (substr(save, 1, 28) < substr($0, 1, 28)) {
print "Missing in O file:", save > err_file
if (getline save < file <= 0) {
print "Missing in I file:", $0 > err_file
EOF = 1
save = ""
next
}
}

if (substr(save, 1, 28) == substr($0, 1, 28)) {
$NF = substr(save, 30)
print $0
save = ""
next
}
print "Missing in I file:", $0 > err_file
}
END {
if (save != "")
print "Missing in O file:", save > err_file
while (getline save < file > 0) {
print "Missing in O file:", save > err_file
}
} ' o_file

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Check input file rows present or not present in output file

Check input file rows present or not present in output file