topic Re: how to get count of repeated words in a flat file in Operating System - HP-UX

how to get count of repeated words in a flat file

Gopi Kishore m — Fri, 04 Mar 2011 12:55:45 GMT

I want to know how many times a particular word is repeated in a particular flat file.

I am using the following command

grep word textfile |wc -l

word is the desired word

textfile is the file iam searching.

but the above file doesnot give exact count in some scenarios like if the word is repeated in a line it will consider a 1. please suggest

Re: how to get count of repeated words in a flat file

James R. Ferguson — Fri, 04 Mar 2011 14:10:39 GMT

Hi:

# $ perl -nle '$n++ while m{\bword\b}g;END{print $n}' file

...will look for the string "word" and count every instance in the file argument. Matches that begin at the start of a line or terminate at the end, as well as matches are counted. If you substituted the string "words" only matches to "words" and not "word" would be found.

Regards!

...JRF...

Re: how to get count of repeated words in a flat file

Dennis Handly — Sat, 05 Mar 2011 02:46:12 GMT

You could first start with grep to find the lines then use tr(1) or sed(1) to split up the words into separate lines, then just count that:
grep word textfile | tr '[:space:]' '\012' | grep -c word

Re: how to get count of repeated words in a flat file

Hein van den Heuvel — Sat, 05 Mar 2011 15:28:47 GMT

I like JRF's solution.

Pay close attention to the usage of the '\b' regular expression component which takes no space itself bu specifies a work boundary. Just what is needed here it seems.

Applied to the topic text it reports '5' as count for the word 'word' which obviously needs to be changed or become a variable for real work.

Depending on exactly what problem you are trying to solve, it may be beneficial to just count all words and then address the selected words for further processing.

Here is a 'one-liner' to demonstrate that:

$ perl -nle '$w{$_}++ for (split) }{ for (sort {$w{$b}<=>$w{$a}} keys %w) { pri
nt qq($w{$_}\t$_)}' tmp.txt
5 the
5 word
4 a
4 is
3 in
2 file
2 repeated
2 textfile
:

As you see, it also reports 5 for the word 'word'

Enjoy,
Hein

Re: how to get count of repeated words in a flat file

Raj D. — Sat, 05 Mar 2011 20:44:14 GMT

Gopi,

$ awk '{for(i=1;i<=NF;++i) if($i~ "^word$") print $i}' textfile| wc -l

Enjoy, Have fun!,
Raj.

Re: how to get count of repeated words in a flat file

Hein van den Heuvel — Sat, 05 Mar 2011 21:10:32 GMT

Raj,
That is also a fine a solution but I don't understand why you opted for a pipe. I guess I will never understand the typical Unix thinking involved. I come from VMS land, where for the longest times we did not have pipes. When we got them we understood the costs involved.

Not that it matters for occasional use like here, but why print to a pipe segment and re-count what comes out when you can just count while there and print when done?!

Might I suggest:

$ awk '{for(i=1;i<=NF;++i) if($i~ "^word$") count++} END { print count }' textfile

Of course due to the simple split by whitespace, that suffers from the same problem as my perl --> array example.

It will not recognize 'word' in *this* example line, due to the quotes.

Using perl you can fix that using \b to split.

$ perl -nle '$w{$_}++ for (split /\b/) }{ for (sort {$w{$b}<=>$w{$a}} keys %w) { print qq($w{$_}\t$_)}' tmp.txt

(but now it counts whitespace as words also)

Hein.

Re: how to get count of repeated words in a flat file

Raj D. — Sat, 05 Mar 2011 21:34:25 GMT

Hein,

Thats great, thanks for adding the count , pipe is not required as count can be done inside the awk, thanks!. And perl code is nice specially for whitespace trick.,

Rgds,
Raj.

Gopi,
pls post points once you are done.