One of the Perl modules that I have found very useful lately
is the Parallel::Loops module (http://search.cpan.org/~pmorch/Parallel-Loops/lib/Parallel/Loops.pm),
since it easily allows you to run all of your loops in parallel and take
advantage of all of the cores in your CPU.
While there are some limitations on when the module should be used, such
as any type of loop where one iteration is dependent on a previous iteration or
where the execution order of iterations needs to be maintained, there are many
situations where each iteration can be considered a distinct entity. It is in these cases that the
Parallell::Loops module provides an easy way to parallelize a part of your
application.
Let’s consider an example application of a simple link
spider that will take the URLs of 4 Web pages and extract the links present on
each page using the WWW::Mechanize module’s (http://search.cpan.org/~jesse/WWW-Mechanize-1.72/lib/WWW/Mechanize.pm)
links() method. Instead of processing
each Web page one by one, the Parallel::Loops module will be used to
parallelize our loop into 4 processes and simultaneously extract the links from
each page. The Perl script is as
follows:
#!usr/bin/perl
# Copyright 2012- Christopher M. Frenz
# This script is free software - it may be used, copied, redistributed, and/or modified
# under the terms laid forth in the Perl Artistic License
use Parallel::Loops;
use WWW::Mechanize;
use strict;
my @links=('http://www.apress.com','http://www.oreilly.com','http://www.osborne.com','http://samspublishing.ca');
my $maxProcs = 4;
my $pl = Parallel::Loops->new($maxProcs);
my @newlinks;
$pl->share(\@newlinks);
$pl->foreach (\@links, sub{
my $link=$_;
my $ua=WWW::Mechanize->new();
$ua->get($link);
my @urls=$ua->links();
for my $url(@urls){
$url=$url->url;
push (@newlinks, $url);
}
});
for my $newlink(@newlinks){
print "$newlink \n";
}
# Copyright 2012- Christopher M. Frenz
# This script is free software - it may be used, copied, redistributed, and/or modified
# under the terms laid forth in the Perl Artistic License
use Parallel::Loops;
use WWW::Mechanize;
use strict;
my @links=('http://www.apress.com','http://www.oreilly.com','http://www.osborne.com','http://samspublishing.ca');
my $maxProcs = 4;
my $pl = Parallel::Loops->new($maxProcs);
my @newlinks;
$pl->share(\@newlinks);
$pl->foreach (\@links, sub{
my $link=$_;
my $ua=WWW::Mechanize->new();
$ua->get($link);
my @urls=$ua->links();
for my $url(@urls){
$url=$url->url;
push (@newlinks, $url);
}
});
for my $newlink(@newlinks){
print "$newlink \n";
}
In the script it is important to note the line “$pl->share(\@newlinks);” as any elements pushed onto the child copies
of this shared array will automatically be transferred to the parent process
when the child process is completed. The
print statement at the end of the script allows for verification of this, as
the resultant set of links should contain links from each of the 4 Web
pages. The parallelization can be verified
with the Linux “top” command. In the
image below, note the multiple Perl processes (Note: the “defunct” occurred because I
caught the screen shot as the child processes were coming to an end).
1 comment:
Post a Comment