Saturday, May 12, 2012

Take Full Advantage of Your Multicore CPU with Parallel::Loops


One of the Perl modules that I have found very useful lately is the Parallel::Loops module (http://search.cpan.org/~pmorch/Parallel-Loops/lib/Parallel/Loops.pm), since it easily allows you to run all of your loops in parallel and take advantage of all of the cores in your CPU.  While there are some limitations on when the module should be used, such as any type of loop where one iteration is dependent on a previous iteration or where the execution order of iterations needs to be maintained, there are many situations where each iteration can be considered a distinct entity.  It is in these cases that the Parallell::Loops module provides an easy way to parallelize a part of your application. 

Let’s consider an example application of a simple link spider that will take the URLs of 4 Web pages and extract the links present on each page using the WWW::Mechanize module’s (http://search.cpan.org/~jesse/WWW-Mechanize-1.72/lib/WWW/Mechanize.pm) links() method.  Instead of processing each Web page one by one, the Parallel::Loops module will be used to parallelize our loop into 4 processes and simultaneously extract the links from each page.  The Perl script is as follows:

 #!usr/bin/perl

# Copyright 2012- Christopher M. Frenz
# This script is free software - it may be used, copied, redistributed, and/or modified
# under the terms laid forth in the Perl Artistic License

use Parallel::Loops;
use WWW::Mechanize;
use strict;

my @links=('http://www.apress.com','http://www.oreilly.com','http://www.osborne.com','http://samspublishing.ca');

my $maxProcs = 4;
my $pl = Parallel::Loops->new($maxProcs);

my @newlinks;
$pl->share(\@newlinks);

$pl->foreach (\@links, sub{
   my $link=$_;
   my $ua=WWW::Mechanize->new();
   $ua->get($link);
   my @urls=$ua->links();
   for my $url(@urls){
      $url=$url->url;
      push (@newlinks, $url);
   }
 
});

for my $newlink(@newlinks){
   print "$newlink \n";
}

 
In the script it is important to note the line “$pl->share(\@newlinks);”  as any elements pushed onto the child copies of this shared array will automatically be transferred to the parent process when the child process is completed.  The print statement at the end of the script allows for verification of this, as the resultant set of links should contain links from each of the 4 Web pages.  The parallelization can be verified with the Linux “top” command.  In the image below, note the multiple Perl processes (Note: the “defunct” occurred because I caught the screen shot as the child processes were coming to an end).  




1 comment:

suganya said...
This comment has been removed by the author.