Wednesday, August 29, 2012

Extract Information from JavaScript Enabled Content with Perl and V8

One of the common challenges for anyone that currently performs information extraction from Web pages is that more and more Web content is being served up by JavaScript, which makes the content much less accessible than for sites whose content resides solely in HTML. This is one of the reasons that JavaScript based obfuscation is used to protect against email address harvesting like in the HTML shown below:

<title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at
{ coded = "OKUxkq@KwtoO2K.0ko"
  key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
  for (i=0; i<coded.length; i++) {
    if (key.indexOf(coded.charAt(i))==-1) {
      ltr = coded.charAt(i)
      link += (ltr)
    else {   
      ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
      link += (key.charAt(ltr))
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
</script><noscript>Sorry, you need Javascript on to email me.</noscript>

When having to perform information extraction on sites that use JavaScript to serve up content, I find the JavaScript::V8 module very helpful.  Here is a segment of Perl code that uses the V8 JavaScript engine to extract the email address from the HTML page shown above.  


use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;

#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';

#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;

#print "$result\n\n";

#extracts JavaScript
my $js;

#modified JS to make it processable by V8 module

#print "$js\n\n";

#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });

my $mail=$context->eval("$js");

print "$mail\n\n";

No comments: