One of the common challenges for anyone that currently
performs information extraction from Web pages is that more and more Web
content is being served up by JavaScript, which makes the content much less
accessible than for sites whose content resides solely in HTML. This is one of
the reasons that JavaScript based obfuscation is used to protect against email address
harvesting like in the HTML shown below:
<title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
When having to perform information extraction on sites that
use JavaScript to serve up content, I find the JavaScript::V8 module very
helpful. Here is a segment of Perl code
that uses the V8 JavaScript engine to extract the email address from the HTML
page shown above.
#!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$result\n\n";
#extracts JavaScript
my $js;
if($result=~s/.*?http:\/\/www.jottings.com\/obfuscator\/\s*\{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/\\'/g;
#print "$js\n\n";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mail\n\n";
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$result\n\n";
#extracts JavaScript
my $js;
if($result=~s/.*?http:\/\/www.jottings.com\/obfuscator\/\s*\{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/\\'/g;
#print "$js\n\n";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mail\n\n";