Friday, May 11, 2012

How Many Times Did that Word Appear?


Below is simple Perl script for counting how many times a word of interest appears in a text file (Input.txt).  Words of interest are specified in the regular expression stored in the $bucket variable and the number of occurrences of each word is output to a file called WordFreqs.txt.  To illustrate it use, let’s consider the following as the contents of the Input.txt file:

red green yellow blue red
red red yellow
yellow green blue blue green
green red

Now we can run that file through the following Perl script:

#usr/bin/perl

# Copyright 2012- Christopher M. Frenz
# This script is free software - it may be used, copied, redistributed, and/or modified
# under the terms laid forth in the Perl Artistic License

sub by_count {
   $count{$b} <=> $count{$a};
}

open(INPUT, "<Input.txt");
open(OUTPUT, ">WordFreqs.txt");
$bucket='red|blue|green';

while(<INPUT>){
   @words = split(/\s+/);
   foreach $word (@words){
            if($word=~/($bucket)/io){
      $count{$1}++;}
   }
}
foreach $word (sort by_count keys %count) {
   print OUTPUT "$word occurs $count{$word} times\n";
}

close INPUT;
close OUTPUT;

which will yield a WordFreqs.txt file with the following contents:

red occurs 5 times
green occurs 4 times
blue occurs 3 times

6 comments:

Bruno Pinto said...

Thank you so much, this was really useful. Is there anyway i can use wildcards with this? I need to count .xml tags in a file and need to know how many times each one apears, so the $bucket variable would be something like "<* *>" but it's only counting how many times the "<" appears. Is there any way to make it list count the expressions between <>?

cfrenz said...

You could theoretically use a modification such as the following to perform this:

#usr/bin/perl

my $XML='<tag1><tag2></tag2><tag2></tag2></tag1>';

while($XML=~/<(.*?)>/g){
$count{$1}++;
}

while( my ($key,$value)=each(%count)){
print "$key => $value\n";
}

However, this is the type of situation that you would probably be better served making use of an XML parsing module rather than regular expressions. An example of an XML parser would be XML::LibXML.

Bruno Pinto said...

I'm new to perl and programming in general so i have no idea how to use that. But anyway i'll try your code. In case it doesn't work i'll try a xml parser. Thank you very much again

Bruno Pinto said...

No luck

Bruno Pinto said...

Could you give me an example on how to do this with xml parser?

Bruno Pinto said...

Solved it, thanks for your help, i got it parsing everything between '<' and '>'.

Any idea on how to make it recursive?
http://stackoverflow.com/questions/16689082/recursive-open-files-in-perl