Re: how to take out duplicate ones and keep the sequences in the file

Hanry Zhou · ‎05-22-2007

I have a file, and it includes multiple items(words), one item per line. Some of itmes are duplicaetd, I want to remove the repeated ones, and also leave unique one in the file. I also want to keep the original sequences of these items( which means, I can not use sort -u). How do I achieve that by using ksh?

Thanks in advance

none

Robert-Jan Goossens · ‎05-22-2007

Hi Hanry,

What about the "uniq" command?

Robert-Jan

Ivan Ferreira · ‎05-22-2007

A script like this could do the job:

OLDFILE=/tmp/original_file
NEWFILE=/tmp/new_file

touch $NEWFILE
for LINE in `cat $OLDFILE`
do
EXISTS=`grep -w $LINE $NEWFILE | wc -l`
if [ $EXISTS -eq 0 ]
then
# The word is not in the new file yet
echo $LINE >> $NEWFILE
fi
done

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?

Ivan Ferreira · ‎05-22-2007

The problem with uniq is that the works must be sorted (duplicates followed), and he does not want to change the order.

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?

James R. Ferguson · ‎05-22-2007

Hi Hanry:

# perl -ne 'push @list,$_ unless $found{$_}++;END{print for (@list)}' file

Regards!

...JRF...

A. Clay Stephenson · ‎05-22-2007

You have a rather annoying practice of not knowing how to do something and then specifying that it be done not only in the shell but in a particular shell. This task would be MUCH prettier and elegant in Perl but we can leverage sort, uniq, and grep -q to do what you want in the Korn sh.

----------------------------------------
#!/usr/bin/ksh

TDIR=${TMPDIR:-/var/tmp}
UNIQUES=${TDIR}/F${$}.uniq
DUPS=${TDIR}/F${$}.dup
TFILE=${TDIR}/F${$}.tmp

trap 'eval rm -r ${UNIQUES} ${DUPS} ${TFILE}' 0 1 2 3 15

# Copy stdin to a temp file
rm -f ${TFILE} ${DUPS}
while read X
do
echo "${X}" >> ${TFILE}
done
# Sort temp file and find unique words
sort ${TFILE} | uniq -u > ${UNIQUES}
echo "\c" > ${DUPS} # null file
# Now read temp file; if word is unique echo it
cat ${TFILE} | while read X
do
grep -q "${X}" ${UNIQUES}
STAT=${?}
if [[ ${STAT} -eq 0 ]]
then
echo "${X}"
else
# not found in Unique file; see if it is in dups
grep -q "${X}" ${DUPS}
STAT=${?}
if [[ ${STAT} -ne 0 ]]
then # not already written; echo to stdout and insert in dups file
echo "${X}"
echo "${X}" >> ${DUPS}
fi
fi
done
exit 0
-----------------------------------------

Useit like this:
removedups.sh < infile > outfile

What is does is first copy each line of stdin to a temporary file. Next that temporary file is sorted and passed to uniq -u to create a second temporary file containing only unique lines. Now we reread the temporary file and use grep -q to determine if the line is unique. If so, we echo it to stdout. If not, we now need to determine if this is the first time that the duplicate word has been echo'ed. We use grep to examine a third temporary file to see if the word is found, if not, echo the line to stdout and also append it to the third temporary file. When finished, a trap removes all the temporary file and your duplicates have been removed and the original order has been preserved.

NOTE: This still should have been done in Perl.

If it ain't broke, I can fix that.

Hanry Zhou · ‎05-22-2007

uniq oldfile > newfile

that is it

none

James R. Ferguson · ‎05-22-2007

Hanry:

> uniq oldfile > newfile

that is it

*NO* it's not, unless the input file is sorted.

...JRF...

Hanry Zhou · ‎05-22-2007

I don't need these items in the file to be sorted, so uniq command should work.

none

Hanry Zhou · ‎05-22-2007

Hi James,

You are right, Just find out why I can not use "uniq". Thanks.

none

Sandman! · ‎05-22-2007

For uniq(1) to work the repeated lines need to be adjacent. Moreover uniq(1) will not preserve the original order of the items in the input file. See the man page of uniq(1) for details. The awk construct below might work so give it a try:

# awk '{x[$1]++;if(x[$1]==1) print $1}' inputfile

~cheers

drb_1 · ‎05-23-2007

Though I personally prefer a 1-line perl for such, I was intrigued to discover how easily this could be done in shell.

cat test.words |
grep -n .* |
sort -u -t: -k2 |
sort -t: -1n |
cut -d: -f2-
> test.words.sansdupes

1. Prefix a line number and : to each line
2. Sort by remainder of line and remove dupes.
3. Sort by line number
4. Remove line number

Interesting,

Dennis Handly · ‎05-23-2007

>drb: 1. Prefix a line number and : to each line

Yes, that's how I would do it. Except you can refine your steps:
$ nl -ba -s: -nrz test.words | sort -t: -u -k2,2 | sort -t: -n -k1,1 |
cut -d: -f2- > test.words.sansdupes

I'm not sure why you had sort -1n? It worked but you would be hard pressed to prove it was legal from sort(1).

The problem with Ivan and Clay's solutions is that it will be real slow if there are lots of lines, because it searches each line against all others.

>Clay: # Copy stdin to a temp file

This can be done with cat - > file

>echo "\c" > ${DUPS} # null file

This can be done with just: > ${DUPS}

> grep -q "${X}" ${UNIQUES}

The only advantage over Ivan's is that the uniques file is smaller.

Sandman's solution trades off memory for speed, so would be good for small files.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: how to take out duplicate ones and keep the sequences in the file

how to take out duplicate ones and keep the sequences in the file