Operating System - HP-UX
1753473 Members
4771 Online
108794 Solutions
New Discussion юеВ

how to get count of repeated words in a flat file

 
Gopi Kishore m
Occasional Advisor

how to get count of repeated words in a flat file

I want to know how many times a particular word is repeated in a particular flat file.

I am using the following command

grep word textfile |wc -l

word is the desired word

textfile is the file iam searching.

but the above file doesnot give exact count in some scenarios like if the word is repeated in a line it will consider a 1. please suggest
6 REPLIES 6
James R. Ferguson
Acclaimed Contributor

Re: how to get count of repeated words in a flat file

Hi:

# $ perl -nle '$n++ while m{\bword\b}g;END{print $n}' file

...will look for the string "word" and count every instance in the file argument. Matches that begin at the start of a line or terminate at the end, as well as matches are counted. If you substituted the string "words" only matches to "words" and not "word" would be found.

Regards!

...JRF...
Dennis Handly
Acclaimed Contributor

Re: how to get count of repeated words in a flat file

You could first start with grep to find the lines then use tr(1) or sed(1) to split up the words into separate lines, then just count that:
grep word textfile | tr '[:space:]' '\012' | grep -c word
Hein van den Heuvel
Honored Contributor

Re: how to get count of repeated words in a flat file

I like JRF's solution.

Pay close attention to the usage of the '\b' regular expression component which takes no space itself bu specifies a work boundary. Just what is needed here it seems.

Applied to the topic text it reports '5' as count for the word 'word' which obviously needs to be changed or become a variable for real work.

Depending on exactly what problem you are trying to solve, it may be beneficial to just count all words and then address the selected words for further processing.

Here is a 'one-liner' to demonstrate that:

$ perl -nle '$w{$_}++ for (split) }{ for (sort {$w{$b}<=>$w{$a}} keys %w) { pri
nt qq($w{$_}\t$_)}' tmp.txt
5 the
5 word
4 a
4 is
3 in
2 file
2 repeated
2 textfile
:

As you see, it also reports 5 for the word 'word'

Enjoy,
Hein
Raj D.
Honored Contributor

Re: how to get count of repeated words in a flat file

Gopi,

$ awk '{for(i=1;i<=NF;++i) if($i~ "^word$") print $i}' textfile| wc -l

Enjoy, Have fun!,
Raj.
" If u think u can , If u think u cannot , - You are always Right . "
Hein van den Heuvel
Honored Contributor

Re: how to get count of repeated words in a flat file

Raj,
That is also a fine a solution but I don't understand why you opted for a pipe. I guess I will never understand the typical Unix thinking involved. I come from VMS land, where for the longest times we did not have pipes. When we got them we understood the costs involved.

Not that it matters for occasional use like here, but why print to a pipe segment and re-count what comes out when you can just count while there and print when done?!

Might I suggest:

$ awk '{for(i=1;i<=NF;++i) if($i~ "^word$") count++} END { print count }' textfile

Of course due to the simple split by whitespace, that suffers from the same problem as my perl --> array example.

It will not recognize 'word' in *this* example line, due to the quotes.

Using perl you can fix that using \b to split.

$ perl -nle '$w{$_}++ for (split /\b/) }{ for (sort {$w{$b}<=>$w{$a}} keys %w) { print qq($w{$_}\t$_)}' tmp.txt

(but now it counts whitespace as words also)

Hein.

Raj D.
Honored Contributor

Re: how to get count of repeated words in a flat file

Hein,

Thats great, thanks for adding the count , pipe is not required as count can be done inside the awk, thanks!. And perl code is nice specially for whitespace trick.,



Rgds,
Raj.

Gopi,
pls post points once you are done.
" If u think u can , If u think u cannot , - You are always Right . "