topic Re: how to take out duplicate ones and keep the sequences in the file in Operating System - HP-UX

how to take out duplicate ones and keep the sequences in the file

Hanry Zhou — Tue, 22 May 2007 14:23:17 GMT

I have a file, and it includes multiple items(words), one item per line. Some of itmes are duplicaetd, I want to remove the repeated ones, and also leave unique one in the file. I also want to keep the original sequences of these items( which means, I can not use sort -u). How do I achieve that by using ksh?

Thanks in advance

Re: how to take out duplicate ones and keep the sequences in the file

Robert-Jan Goossens — Tue, 22 May 2007 14:41:01 GMT

Hi Hanry,

What about the "uniq" command?

Robert-Jan

Re: how to take out duplicate ones and keep the sequences in the file

Ivan Ferreira — Tue, 22 May 2007 14:41:31 GMT

A script like this could do the job:

OLDFILE=/tmp/original_file
NEWFILE=/tmp/new_file

touch $NEWFILE
for LINE in `cat $OLDFILE`
do
EXISTS=`grep -w $LINE $NEWFILE | wc -l`
if [ $EXISTS -eq 0 ]
then
# The word is not in the new file yet
echo $LINE >> $NEWFILE
fi
done

Re: how to take out duplicate ones and keep the sequences in the file

Ivan Ferreira — Tue, 22 May 2007 14:44:01 GMT

The problem with uniq is that the works must be sorted (duplicates followed), and he does not want to change the order.

Re: how to take out duplicate ones and keep the sequences in the file

James R. Ferguson — Tue, 22 May 2007 14:55:40 GMT

Hi Hanry:

# perl -ne 'push @list,$_ unless $found{$_}++;END{print for (@list)}' file

Regards!

...JRF...

Re: how to take out duplicate ones and keep the sequences in the file

A. Clay Stephenson — Tue, 22 May 2007 14:57:08 GMT

You have a rather annoying practice of not knowing how to do something and then specifying that it be done not only in the shell but in a particular shell. This task would be MUCH prettier and elegant in Perl but we can leverage sort, uniq, and grep -q to do what you want in the Korn sh.

----------------------------------------
#!/usr/bin/ksh

TDIR=${TMPDIR:-/var/tmp}
UNIQUES=${TDIR}/F${$}.uniq
DUPS=${TDIR}/F${$}.dup
TFILE=${TDIR}/F${$}.tmp

trap 'eval rm -r ${UNIQUES} ${DUPS} ${TFILE}' 0 1 2 3 15

# Copy stdin to a temp file
rm -f ${TFILE} ${DUPS}
while read X
do
echo "${X}" >> ${TFILE}
done
# Sort temp file and find unique words
sort ${TFILE} | uniq -u > ${UNIQUES}
echo "\c" > ${DUPS} # null file
# Now read temp file; if word is unique echo it
cat ${TFILE} | while read X
do
grep -q "${X}" ${UNIQUES}
STAT=${?}
if [[ ${STAT} -eq 0 ]]
then
echo "${X}"
else
# not found in Unique file; see if it is in dups
grep -q "${X}" ${DUPS}
STAT=${?}
if [[ ${STAT} -ne 0 ]]
then # not already written; echo to stdout and insert in dups file
echo "${X}"
echo "${X}" >> ${DUPS}
fi
fi
done
exit 0
-----------------------------------------

Useit like this:
removedups.sh < infile > outfile

What is does is first copy each line of stdin to a temporary file. Next that temporary file is sorted and passed to uniq -u to create a second temporary file containing only unique lines. Now we reread the temporary file and use grep -q to determine if the line is unique. If so, we echo it to stdout. If not, we now need to determine if this is the first time that the duplicate word has been echo'ed. We use grep to examine a third temporary file to see if the word is found, if not, echo the line to stdout and also append it to the third temporary file. When finished, a trap removes all the temporary file and your duplicates have been removed and the original order has been preserved.

NOTE: This still should have been done in Perl.

Re: how to take out duplicate ones and keep the sequences in the file

Hanry Zhou — Tue, 22 May 2007 15:23:49 GMT

uniq oldfile > newfile

that is it

Re: how to take out duplicate ones and keep the sequences in the file

James R. Ferguson — Tue, 22 May 2007 15:28:11 GMT

Hanry:

> uniq oldfile > newfile

that is it

*NO* it's not, unless the input file is sorted.

...JRF...

Re: how to take out duplicate ones and keep the sequences in the file

Hanry Zhou — Tue, 22 May 2007 15:48:43 GMT

I don't need these items in the file to be sorted, so uniq command should work.

Re: how to take out duplicate ones and keep the sequences in the file

Hanry Zhou — Tue, 22 May 2007 15:50:52 GMT

Hi James,

You are right, Just find out why I can not use "uniq". Thanks.

Re: how to take out duplicate ones and keep the sequences in the file

Sandman! — Tue, 22 May 2007 15:55:29 GMT

For uniq(1) to work the repeated lines need to be adjacent. Moreover uniq(1) will not preserve the original order of the items in the input file. See the man page of uniq(1) for details. The awk construct below might work so give it a try:

# awk '{x[$1]++;if(x[$1]==1) print $1}' inputfile

~cheers

Re: how to take out duplicate ones and keep the sequences in the file

drb_1 — Wed, 23 May 2007 08:17:22 GMT

Though I personally prefer a 1-line perl for such, I was intrigued to discover how easily this could be done in shell.

cat test.words |
grep -n .* |
sort -u -t: -k2 |
sort -t: -1n |
cut -d: -f2-
> test.words.sansdupes

1. Prefix a line number and : to each line
2. Sort by remainder of line and remove dupes.
3. Sort by line number
4. Remove line number

Interesting,

Re: how to take out duplicate ones and keep the sequences in the file

Dennis Handly — Thu, 24 May 2007 02:09:51 GMT

>drb: 1. Prefix a line number and : to each line

Yes, that's how I would do it. Except you can refine your steps:
$ nl -ba -s: -nrz test.words | sort -t: -u -k2,2 | sort -t: -n -k1,1 |
cut -d: -f2- > test.words.sansdupes

I'm not sure why you had sort -1n? It worked but you would be hard pressed to prove it was legal from sort(1).

The problem with Ivan and Clay's solutions is that it will be real slow if there are lots of lines, because it searches each line against all others.

>Clay: # Copy stdin to a temp file

This can be done with cat - > file

>echo "\c" > ${DUPS} # null file

This can be done with just: > ${DUPS}

> grep -q "${X}" ${UNIQUES}

The only advantage over Ivan's is that the uniques file is smaller.

Sandman's solution trades off memory for speed, so would be good for small files.