Operating System - HP-UX
1833757 Members
3024 Online
110063 Solutions
New Discussion

Re: how to take out duplicate ones and keep the sequences in the file

 
SOLVED
Go to solution
Hanry Zhou
Super Advisor

how to take out duplicate ones and keep the sequences in the file

I have a file, and it includes multiple items(words), one item per line. Some of itmes are duplicaetd, I want to remove the repeated ones, and also leave unique one in the file. I also want to keep the original sequences of these items( which means, I can not use sort -u). How do I achieve that by using ksh?

Thanks in advance
none
12 REPLIES 12
Robert-Jan Goossens
Honored Contributor

Re: how to take out duplicate ones and keep the sequences in the file

Hi Hanry,

What about the "uniq" command?

Robert-Jan
Ivan Ferreira
Honored Contributor

Re: how to take out duplicate ones and keep the sequences in the file

A script like this could do the job:

OLDFILE=/tmp/original_file
NEWFILE=/tmp/new_file

touch $NEWFILE
for LINE in `cat $OLDFILE`
do
EXISTS=`grep -w $LINE $NEWFILE | wc -l`
if [ $EXISTS -eq 0 ]
then
# The word is not in the new file yet
echo $LINE >> $NEWFILE
fi
done
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Ivan Ferreira
Honored Contributor

Re: how to take out duplicate ones and keep the sequences in the file

The problem with uniq is that the works must be sorted (duplicates followed), and he does not want to change the order.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
James R. Ferguson
Acclaimed Contributor

Re: how to take out duplicate ones and keep the sequences in the file

Hi Hanry:

# perl -ne 'push @list,$_ unless $found{$_}++;END{print for (@list)}' file

Regards!

...JRF...
A. Clay Stephenson
Acclaimed Contributor

Re: how to take out duplicate ones and keep the sequences in the file

You have a rather annoying practice of not knowing how to do something and then specifying that it be done not only in the shell but in a particular shell. This task would be MUCH prettier and elegant in Perl but we can leverage sort, uniq, and grep -q to do what you want in the Korn sh.

----------------------------------------
#!/usr/bin/ksh


TDIR=${TMPDIR:-/var/tmp}
UNIQUES=${TDIR}/F${$}.uniq
DUPS=${TDIR}/F${$}.dup
TFILE=${TDIR}/F${$}.tmp

trap 'eval rm -r ${UNIQUES} ${DUPS} ${TFILE}' 0 1 2 3 15

# Copy stdin to a temp file
rm -f ${TFILE} ${DUPS}
while read X
do
echo "${X}" >> ${TFILE}
done
# Sort temp file and find unique words
sort ${TFILE} | uniq -u > ${UNIQUES}
echo "\c" > ${DUPS} # null file
# Now read temp file; if word is unique echo it
cat ${TFILE} | while read X
do
grep -q "${X}" ${UNIQUES}
STAT=${?}
if [[ ${STAT} -eq 0 ]]
then
echo "${X}"
else
# not found in Unique file; see if it is in dups
grep -q "${X}" ${DUPS}
STAT=${?}
if [[ ${STAT} -ne 0 ]]
then # not already written; echo to stdout and insert in dups file
echo "${X}"
echo "${X}" >> ${DUPS}
fi
fi
done
exit 0
-----------------------------------------

Useit like this:
removedups.sh < infile > outfile

What is does is first copy each line of stdin to a temporary file. Next that temporary file is sorted and passed to uniq -u to create a second temporary file containing only unique lines. Now we reread the temporary file and use grep -q to determine if the line is unique. If so, we echo it to stdout. If not, we now need to determine if this is the first time that the duplicate word has been echo'ed. We use grep to examine a third temporary file to see if the word is found, if not, echo the line to stdout and also append it to the third temporary file. When finished, a trap removes all the temporary file and your duplicates have been removed and the original order has been preserved.

NOTE: This still should have been done in Perl.
If it ain't broke, I can fix that.
Hanry Zhou
Super Advisor

Re: how to take out duplicate ones and keep the sequences in the file

uniq oldfile > newfile

that is it
none
James R. Ferguson
Acclaimed Contributor

Re: how to take out duplicate ones and keep the sequences in the file

Hanry:

> uniq oldfile > newfile

that is it

*NO* it's not, unless the input file is sorted.

...JRF...
Hanry Zhou
Super Advisor

Re: how to take out duplicate ones and keep the sequences in the file

I don't need these items in the file to be sorted, so uniq command should work.
none
Hanry Zhou
Super Advisor

Re: how to take out duplicate ones and keep the sequences in the file

Hi James,

You are right, Just find out why I can not use "uniq". Thanks.
none
Sandman!
Honored Contributor
Solution

Re: how to take out duplicate ones and keep the sequences in the file

For uniq(1) to work the repeated lines need to be adjacent. Moreover uniq(1) will not preserve the original order of the items in the input file. See the man page of uniq(1) for details. The awk construct below might work so give it a try:

# awk '{x[$1]++;if(x[$1]==1) print $1}' inputfile

~cheers
drb_1
Occasional Advisor

Re: how to take out duplicate ones and keep the sequences in the file

Though I personally prefer a 1-line perl for such, I was intrigued to discover how easily this could be done in shell.

cat test.words |
grep -n .* |
sort -u -t: -k2 |
sort -t: -1n |
cut -d: -f2-
> test.words.sansdupes

1. Prefix a line number and : to each line
2. Sort by remainder of line and remove dupes.
3. Sort by line number
4. Remove line number

Interesting,
Dennis Handly
Acclaimed Contributor

Re: how to take out duplicate ones and keep the sequences in the file

>drb: 1. Prefix a line number and : to each line

Yes, that's how I would do it. Except you can refine your steps:
$ nl -ba -s: -nrz test.words | sort -t: -u -k2,2 | sort -t: -n -k1,1 |
cut -d: -f2- > test.words.sansdupes

I'm not sure why you had sort -1n? It worked but you would be hard pressed to prove it was legal from sort(1).

The problem with Ivan and Clay's solutions is that it will be real slow if there are lots of lines, because it searches each line against all others.

>Clay: # Copy stdin to a temp file

This can be done with cat - > file

>echo "\c" > ${DUPS} # null file

This can be done with just: > ${DUPS}

> grep -q "${X}" ${UNIQUES}

The only advantage over Ivan's is that the uniques file is smaller.

Sandman's solution trades off memory for speed, so would be good for small files.