This is a script that I wrote a number of years ago that
provides a way of searching Pubmed abstracts using regular expression based
pattern matching and, hence, it provides a way to pull out mutation data and
other types of biomedical data that can be fit to a text pattern but not easily
represented with keywords. Publications
describing some sample bioinformatics usages of the script for mining mutation
data, gathering links to biomedical resources, and profiling developmental
factors that play a role in the inner ear can be found in the links below. I am reposting this code here because it
seems the original host (bioinformatics.org) no longer makes a copy of the
script available and I have recently received a number of requests for copies
of the script.
Sample Publications:
Code:
#!/usr/bin/perl # PREP (Perl RegExps for Pubmed) is a script that allows the use of # Perl regexs in the searching of Pubmed records, providing the ability to search # records for textual patterns as well as keywords # Copyright 2005- Christopher M. Frenz # This script is free sofware it may be used, copied, redistributed, and/or modified # under the terms laid forth in the Perl Artisic License # Please cite this script in any publication in which literature cited within the # publication was located using the PREP.pl script. # Usage: perl PREPv1-0.pl PubmedQueryTerms # Usage of this script requires the LWP and XML::LibXML modules are installed use LWP; use XML::LibXML; #Version 1.58 used for development and testing # Change the variable below to set the text pattern that Perl # will seek to match in the returned results my $regex='[ARNDCEQGHILKMFPSTWYV]\d+[ARNDCEQGHILKMFPSTWYV]'; my $request; my $response; my $query; # Concatenates arguments passed to script to form Pubmed query $query=join(" ", @ARGV); # Creates the URL to search Pubmed my $baseurl="http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"; my $url=$baseurl . "db=Pubmed&retmax=1&usehistory=y&term=" . $query; # Searches Pubmed and Returns the number of results # as well as the session information needed for results retrieval $request=LWP::UserAgent->new(); $response=$request->get($url); my $results= $response->content; die unless $response->is_success; print "PubMed Search Results \n"; $results=~/<Count>(\d+)<\/Count>/; my $NumAbstracts=$1; $results=~/<QueryKey>(\d+)<\/QueryKey>/; my $QueryKey=$1; $results=~/<WebEnv>(.*?)<\/WebEnv>/; my $WebEnv=$1; print "$NumAbstracts are Available \n"; print "Query Key= $QueryKey \n"; print "WebEnv= $WebEnv \n"; # Opens a file for output open(OFile, ">PREPout.html"); my $parser=XML::LibXML->new; my $retmax=500; #Number of records to be retrieved per request-Max 500 my $retstart=0; #Record number to start retreival from # Creates the URL needed to retrieve results $baseurl="http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"; my $url2="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids="; my $Count=0; # Retreives results in XML format for($retstart=0;$retstart<=$NumAbstracts;$retstart+=$retmax){ print "Processing record # $retstart \n"; $url=$baseurl . "rettype=abstract&retmode=xml&retstart=$retstart&retmax=$retmax&db=Pubmed&query_key=$QueryKey&WebEnv=$WebEnv"; $response=$request->get($url); $results=$response->content; die unless $response->is_success; # Uses a DOM based XML parser to process returned results my $domtree=$parser->parse_string($results); @Records=$domtree->getElementsByTagName("PubmedArticle"); my $i=0; foreach(@Records){ # Extracts element data for regex processing and output formatting $titles=$Records[$i]->getElementsByTagName("ArticleTitle"); $journals=$Records[$i]->getElementsByTagName("MedlineTA"); $volumes=$Records[$i]->getElementsByTagName("Volume"); $pgnums=$Records[$i]->getElementsByTagName("MedlinePgn"); $abstracts=$Records[$i]->getElementsByTagName("AbstractText"); $IDS=$Records[$i]->getElementsByTagName("PMID"); # Processes title and abstract for pattern match and if a match occurs # data is written to output if($titles=~/($regex)/ or $abstracts=~/($regex)/){ print OFile "<h1>Pattern Match: $1 </h1>\n"; print OFile "<h3><a href=\"$url2$IDS\">$titles </a></h3> \n"; print OFile "<p>$journals $volumes, $pgnums </p>\n"; print OFile "<p>$abstracts </p>\n\n"; $Count=$Count+1; } $i=$i+1; } } close OFile; print "$Count records matched the pattern";
No comments:
Post a Comment