Operating System - HP-UX
1834744 Members
2787 Online
110070 Solutions
New Discussion

Re: Need awk scirpt for below requirements

 
Swetha reddy
Occasional Contributor

Need awk scirpt for below requirements

1|222|22|45|99
1|221|33|33|88
6|333|21|65|12

Need to verify for the duplicate rows( keys 1 column ,5 th column)

Output:

Number of rows=3
Number of duplicate rows= 1

Note:

code should take care about millions of records
6 REPLIES 6
RAC_1
Honored Contributor

Re: Need awk scirpt for below requirements

Not getting it. What exactly you want? If first coulmn has same value, will be taken as duplicate entry?
There is no substitute to HARDWORK
Peter Godron
Honored Contributor

Re: Need awk scirpt for below requirements

Swetha,
cut -d'|' -f1 data.lis | uniq -c
would return:
2 1
1 6

which translates into:
2 records with a key of value 1
1 record with a key of 6

Victor Fridyev
Honored Contributor

Re: Need awk scirpt for below requirements

Hi,

If you really need to count duplication of the first column, so a right code is
cut -d'|' -f1 data.lis|sort| uniq -c

If you need to check a duplication of more than one column, so

awk -F| '{printf("%s%s\n",$1,$5)}' data.lis |sort| uniq -c

HTH
Entities are not to be multiplied beyond necessity - RTFM
Peter Godron
Honored Contributor

Re: Need awk scirpt for below requirements

Victor,
assumed sorted files based on previous thread by same poster:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1018063
Ninad_1
Honored Contributor

Re: Need awk scirpt for below requirements

Swetha,

Not very much clear with what you treat as duplicate when you say "keys 1 column ,5 th column" - do you mean
1st and 5th column OR
1st to 5th column (all fields when same - then treat as duplicate)
1st column only?

If its 1st column only then the simplest thing I can think is

norows=$(wc -l t1.dat | awk '{print $1}')
duplicate=$(echo "$norows - $(cut -f 1 -d "|" t1.dat | sort -u | wc -l | awk '{p
rint $1}')" | bc)
echo "Number of rows=$norows"
echo "Number of duplicate rows=$duplicate"

Regards,
Ninad
Peter Nikitka
Honored Contributor

Re: Need awk scirpt for below requirements

Hi,

since your request is not uniq, I make smone assumptions:
- 'duplicate' means duplicate in col1 OR col5
- multiple duplicate cols (more than 2) count multiple
- data are in input file /tmp/data

sort -t'|' -k1n,1 -k5n /tmp/data |
awk -F'|' 'NR==1 {c1=$1;c5=$5; next}
{if($1==c1) dup1++; else c1=$1
if ($5==c5) dup5++; else c5=$5}
END {print "Number of rows",NR;print "Number of duplicate rows",dup1+dup5}'

First line is treated special else it would be reported as duplicate if first column value was NULL.

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"