Hey, everybody loves BLAST right? I mean, geez, how can get it get any easier to do comparisons between one of your sequences and every other sequence in the known world? But, of course, this section isn’t about how cool BLAST is, since we already know that. It is about the problem with BLAST – it can be really difficult to deal with the volume of data generated by large runs, and to automate BLAST runs in general.
Fortunately, the Biopython folks know this only too well, so they’ve developed lots of tools for dealing with BLAST and making things much easier. This section details how to use these tools and do useful things with them.
Dealing with BLAST can be split up into two steps, both of which can be done from within Biopython. Firstly, running BLAST for your query sequence(s), and getting some output. Secondly, parsing the BLAST output in python for further analysis. We’ll start by talking about running the BLAST command line tools locally, and then discuss running BLAST via the web.
Running BLAST locally (as opposed to over the internet, see Section 6.2) has two advantages:
Dealing with proprietary or unpublished sequence data can be another reason to run BLAST locally. You may not be allowed to redistribute the sequences, so submitting them to the NCBI as a BLAST query would not be an option.
Biopython provides lots of nice code to enable you to call local BLAST executables from your scripts, and have full access to the many command line options that these executables provide. You can obtain local BLAST precompiled for a number of platforms at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/, or can compile it yourself in the NCBI toolbox (ftp://ftp.ncbi.nlm.nih.gov/toolbox/).
The code for dealing with local BLAST is found in Bio.Blast.NCBIStandalone
, specifically in the functions blastall
, blastpgp
and rpsblast
, which correspond with the BLAST executables that their names imply.
Let’s use these functions to run a blastall
against a local database and return the results. First, we want to set up the paths to everything that we’ll need to do the BLAST. What we need to know is the path to the database (which should have been prepared using formatdb
, see ftp://ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html) to search against, the path to the file we want to search, and the path to the blastall
executable.
On Linux or Mac OS X your paths might look like this:
>>> my_blast_db = "/home/mdehoon/Data/Genomes/Databases/bsubtilis" # I used formatdb to create a BLAST database named bsubtilis # (for Bacillus subtilis) consisting of the following three files: # /home/mdehoon/Data/Genomes/Databases/bsubtilis.nhr # /home/mdehoon/Data/Genomes/Databases/bsubtilis.nin # /home/mdehoon/Data/Genomes/Databases/bsubtilis.nsq >>> my_blast_file = "m_cold.fasta" # A FASTA file with the sequence I want to BLAST >>> my_blast_exe = "/usr/local/blast/bin/blastall" # The name of my BLAST executable
while on Windows you might have something like this:
>>> my_blast_db = r"C:\Blast\Data\bsubtilis" # Assuming you used formatdb to create a BLAST database named bsubtilis # (for Bacillus subtilis) consisting of the following three files: # C:\Blast\Data\bsubtilis\bsubtilis.nhr # C:\Blast\Data\bsubtilis\bsubtilis.nin # C:\Blast\Data\bsubtilis\bsubtilis.nsq >>> my_blast_file = "m_cold.fasta" >>> my_blast_exe =r"C:\Blast\bin\blastall.exe"
The FASTA file used in this example is available here as well as online.
Now that we’ve got that all set, we are ready to run the BLAST and collect the results. We can do this with two lines:
>>> from Bio.Blast import NCBIStandalone >>> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file)
Note that the Biopython interfaces to local blast programs returns two values. The first is a handle to the blast output, which is ready to either be saved or passed to a parser. The second is the possible error output generated by the blast command. See Section 12.1 for more about handles.
The error info can be hard to deal with, because if you try to do a error_handle.read()
and there was no error info returned, then the read()
call will block and not return, locking your script. In my opinion, the best way to deal with the error is only to print it out if you are not getting result_handle
results to be parsed, but otherwise to leave it alone.
This command will generate BLAST output in XML format, as that is the format expected by the XML parser, described in Section 6.4. For plain text output, use the align_view='0'
keyword. To parse text output instead of XML output, see the Section 6.6 below. However, parsing text output is not recommended, as the BLAST plain text output changes frequently, breaking our parsers.
If you are interested in saving your results to a file before parsing them, see Section 6.3. To find out how to parse the BLAST results, go to Section 6.4
The first step in automating BLASTing is to make everything accessible from Python scripts. So, Biopython contains code that allows you to run the WWW version of BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) directly from your Python scripts. This is very nice, especially since otherwise BLAST can be a real pain to deal with from scripts, especially with the whole BLAST queue thing and the separate results page.
The code to deal with the WWW version of BLAST is found in the
Bio.Blast.NCBIWWW
module, and the qblast
function. Let’s
say we want to BLAST info we have in a FASTA formatted file against
the database. First, we need to get the info in the FASTA file.
The easiest way to do this is to use the Bio.SeqIO
module (see
Chapter 4). In this example we’ll use the
Bio.SeqIO.read
function to turn a FASTA file containing a single
entry into a SeqRecord object:
>>> from Bio import SeqIO >>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")
Now we can take the sequence as a plain string from the SeqRecord and run BLAST on it. The code to do the simplest possible BLAST (a simple blastn of the FASTA file against all of the non-redundant databases) is:
>>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast("blastn", "nr", record.seq.tostring())
The first three arguments to our NCBIWWW.qblast
function are non-optional:
qblast
only works with blastn, blastp, blastx, tblast
and tblastx.
The qblast
function also take a number of other option arguments
which are basically analogous to the different parameters you can set
on the BLAST web page. We’ll just highlight a few of them here:
qblast
function can return the BLAST results in various
formats, which you can choose with the optional format_type
keyword:
"HTML"
, "Text"
, "ASN.1"
, or "XML"
.
The default is "XML"
, as that is the format expected by the parser,
described in section 6.4 below.
expect
sets the expectation or e-value threshold.
For more about the optional BLAST arguments, we refer you to the NCBI’s own documentation, or that built into Biopython:
>>> from Bio.Blast import NCBIWWW >>> help(NCBIWWW.qblast)
After you have set the search options, you are all ready to BLAST. Biopython takes care of worrying about when the results are available, and will pause until it can get the results and return them.
Before parsing the results, it is often useful to save them into a file so that you can use them later without having to go back and re-blast everything. I find this especially useful when debugging my code that extracts info from the BLAST files, but it could also be useful just for making backups of things you’ve done.
If you don’t want to save the BLAST output, you can skip to section 6.4. If you do, read on.
We need to be a bit careful since we can use result_handle.read()
to
read the BLAST output only once – calling result_handle.read()
again
returns an empty string. First, we use read()
and store all of
the information from the handle into a string:
>>> blast_results = result_handle.read()
Next, we save this string in a file:
>>> save_file = open("my_blast.xml", "w") >>> save_file.write(blast_results) >>> save_file.close()
After doing this, the results are in the file my_blast.xml
and the
variable blast_results
contains the BLAST results in a string
form. However, the parse
function of the BLAST parser (described
in 6.4) takes a file-handle-like object, not a
plain string. To get a handle, there are two things you can do:
cStringIO
. The
following code will turn the plain string into a handle, which we can
feed directly into the BLAST parser:
>>> import cStringIO >>> result_handle = cStringIO.StringIO(blast_results)
>>> result_handle = open("my_blast.xml")
Now that we’ve got the BLAST results, we are ready to do something with them, so this leads us right into the parsing section.
As mentioned above, BLAST can generate output in various formats, such as XML, HTML, and plain text. Originally, Biopython had a parser for BLAST plain text and HTML output, as these were the only output formats supported by BLAST. Unfortunately, the BLAST output in these formats kept changing, each time breaking the Biopython parsers. As keeping up with changes in BLAST became a hopeless endeavor, especially with users running different BLAST versions, we now recommend to parse the output in XML format, which can be generated by recent versions of BLAST. Not only is the XML output more stable than the plain text and HTML output, it is also much easier to parse automatically, making Biopython a whole lot more stable.
Though deprecated, the parsers for BLAST output in plain text or HTML output are still available in Biopython (see Section 6.6). Use them at your own risk: they may or may not work, depending on which BLAST version you’re using.
You can get BLAST output in XML format in various ways. For the parser, it doesn’t matter how the output was generated, as long as it is in the XML format.
The important point is that you do not have to use Biopython scripts to fetch the data in order to be able to parse it.
Doing things in one of these ways, you then need to get a handle
to the results. In Python, a handle is just a nice general way of
describing input to any info source so that the info can be retrieved
using read()
and readline()
functions. This is the type
of input the BLAST parser (and most other Biopython parsers) take.
If you followed the code above for interacting with BLAST through a
script, then you already have result_handle
, the handle to the
BLAST results. For example:
>>> from Bio import SeqIO >>> record = SeqIO.read(open("m_cold.fasta"), format="fasta") >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast("blastn", "nr", record.seq.tostring())
If instead you ran BLAST some other way, and have the
BLAST output (in XML format) in the file my_blast.xml
, all you
need to do is to open the file for reading:
>>> result_handle = open("my_blast.xml")
Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small:
>>> from Bio.Blast import NCBIXML >>> blast_records = NCBIXML.parse(result_handle)
To understand what NCBIXML.parse
returns, there are two things
that you need to keep in mind:
To be able to handle these situations, NCBIXML.parse
returns an
iterator (just like Bio.SeqIO.parse
). In plain English, an iterator
allows you to step through the BLAST output, retrieving BLAST records one
by one for each BLAST search:
>>> blast_record = blast_records.next() # ... do something with blast_record >>> blast_record = blast_records.next() # ... do something with blast_record >>> blast_record = blast_records.next() # ... do something with blast_record >>> blast_record = blast_records.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration # No further records
Or, you can use a for
-loop:
>>> for blast_record in blast_records: ... # Do something with blast_record
Note though that you can step through the BLAST records only once. Usually, from each BLAST record you would save the information that you are interested in. If you want to save all returned BLAST records, you can convert the iterator into a list:
>>> blast_records = list(blast_records)
Now you can access each BLAST record in the list with an index as usual. If your BLAST file is huge though, you may run into problems trying to save them all in a list.
Usually, you’ll be running one BLAST search at a time. Then, all you need
to do is to pick up the first (and only) BLAST record in blast_records
:
>>> blast_record = blast_records.next()
I guess by now you’re wondering what is in a BLAST record.
A BLAST Record contains everything you might ever want to extract from the BLAST output. Right now we’ll just show an example of how to get some info out of the BLAST report, but if you want something in particular that is not described here, look at the info on the record class in detail, and take a gander into the code or automatically generated documentation – the docstrings have lots of good info about what is stored in each piece of information.
To continue with our example, let’s just print out some summary info about all hits in our blast report greater than a particular threshold. The following code does this:
>>> E_VALUE_THRESH = 0.04 >>> for alignment in blast_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < E_VALUE_THRESH: ... print '****Alignment****' ... print 'sequence:', alignment.title ... print 'length:', alignment.length ... print 'e value:', hsp.expect ... print hsp.query[0:75] + '...' ... print hsp.match[0:75] + '...' ... print hsp.sbjct[0:75] + '...'
This will print out summary reports like the following:
****Alignment**** sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein WCOR413-like protein alpha form mRNA, complete cds length: 783 e value: 0.034 tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc... ||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| ||| ||... tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...
Basically, you can do anything you want to with the info in the BLAST report once you have parsed it. This will, of course, depend on what you want to use it for, but hopefully this helps you get started on doing what you need to do!
An important consideration for extracting information from a BLAST report is the type of objects that the information is stored in. In Biopython, the parsers return Record
objects, either Blast
or PSIBlast
depending on what you are parsing. These objects are defined in Bio.Blast.Record
and are quite complete.
Here are my attempts at UML class diagrams for the Blast
and PSIBlast
record classes. If you are good at UML and see mistakes/improvements that can be made, please let me know. The Blast class diagram is shown in Figure 6.5.
The PSIBlast record object is similar, but has support for the rounds that are used in the iteration steps of PSIBlast. The class diagram for PSIBlast is shown in Figure 6.5.
Older versions of Biopython had parsers for BLAST output in plain text or HTML format. Over the years, we discovered that it is very hard to maintain these parsers in working order. Basically, any small change to the BLAST output in newly released BLAST versions tends to cause the plain text and HTML parsers to break. We therefore recommend parsing BLAST output in XML format, as described in section 6.4. Whereas the plain text and HTML parsers are still available in Biopython, use them at your own risk. They may or may not work, depending on which BLAST versions you’re using.
The plain text BLAST parser is located in Bio.Blast.NCBIStandalone
.
As with the XML parser, we need to have a handle object that we can pass to the parser. The handle must implement the readline()
method and do this properly. The common ways to get such a handle are to either use the provided blastall
or blastpgp
functions to run the local blast, or to run a local blast via the command line, and then do something like the following:
>>> result_handle = open("my_file_of_blast_output.txt")
Well, now that we’ve got a handle (which we’ll call result_handle
),
we are ready to parse it. This can be done with the following code:
>>> from Bio.Blast import NCBIStandalone >>> blast_parser = NCBIStandalone.BlastParser() >>> blast_record = blast_parser.parse(result_handle)
This will parse the BLAST report into a Blast Record class (either a Blast or a PSIBlast record, depending on what you are parsing) so that you can extract the information from it. In our case, let’s just use print out a quick summary of all of the alignments greater than some threshold value.
>>> E_VALUE_THRESH = 0.04 >>> for alignment in b_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < E_VALUE_THRESH: ... print '****Alignment****' ... print 'sequence:', alignment.title ... print 'length:', alignment.length ... print 'e value:', hsp.expect ... print hsp.query[0:75] + '...' ... print hsp.match[0:75] + '...' ... print hsp.sbjct[0:75] + '...'
If you also read the section 6.4 on parsing BLAST XML output, you’ll notice that the above code is identical to what is found in that section. Once you parse something into a record class you can deal with it independent of the format of the original BLAST info you were parsing. Pretty snazzy!
Sure, parsing one record is great, but I’ve got a BLAST file with tons of records – how can I parse them all? Well, fear not, the answer lies in the very next section.
Of course, local blast is cool because you can run a whole bunch of sequences against a database and get back a nice report on all of it. So, Biopython definitely has facilities to make it easy to parse humongous files without memory problems.
We can do this using the blast iterator. To set up an iterator, we first set up a parser, to parse our blast reports in Blast Record objects:
>>> from Bio.Blast import NCBIStandalone >>> blast_parser = NCBIStandalone.BlastParser()
Then we will assume we have a handle to a bunch of blast records, which we’ll call result_handle
. Getting a handle is described in full detail above in the blast parsing sections.
Now that we’ve got a parser and a handle, we are ready to set up the iterator with the following command:
>>> blast_iterator = NCBIStandalone.Iterator(blast_handle, blast_parser)
The second option, the parser, is optional. If we don’t supply a parser, then the iterator will just return the raw BLAST reports one at a time.
Now that we’ve got an iterator, we start retrieving blast records (generated by our parser) using next()
:
>>> blast_record = blast_iterator.next()
Each call to next will return a new record that we can deal with. Now we can iterate through this records and generate our old favorite, a nice little blast report:
>>> for b_record in b_iterator : ... E_VALUE_THRESH = 0.04 ... for alignment in b_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < E_VALUE_THRESH: ... print '****Alignment****' ... print 'sequence:', alignment.title ... print 'length:', alignment.length ... print 'e value:', hsp.expect ... if len(hsp.query) > 75: ... dots = '...' ... else: ... dots = '' ... print hsp.query[0:75] + dots ... print hsp.match[0:75] + dots ... print hsp.sbjct[0:75] + dots
The iterator allows you to deal with huge blast records without any memory problems, since things are read in one at a time. I have parsed tremendously huge files without any problems using this.
One really ugly problem that happens to me is that I’ll be parsing a huge blast file for a while, and the parser will bomb out with a ValueError. This is a serious problem, since you can’t tell if the ValueError is due to a parser problem, or a problem with the BLAST. To make it even worse, you have no idea where the parse failed, so you can’t just ignore the error, since this could be ignoring an important data point.
We used to have to make a little script to get around this problem, but the Bio.Blast
module now includes a BlastErrorParser
which really helps make this easier. The BlastErrorParser
works very similar to the regular BlastParser
, but it adds an extra layer of work by catching ValueErrors that are generated by the parser, and attempting to diagnose the errors.
Let’s take a look at using this parser – first we define the file we are going to parse and the file to write the problem reports to:
>>> import os >>> blast_file = os.path.join(os.getcwd(), "blast_out", "big_blast.out") >>> error_file = os.path.join(os.getcwd(), "blast_out", "big_blast.problems")
Now we want to get a BlastErrorParser
:
>>> from Bio.Blast import NCBIStandalone >>> error_handle = open(error_file, "w") >>> blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
Notice that the parser take an optional argument of a handle. If a handle is passed, then the parser will write any blast records which generate a ValueError to this handle. Otherwise, these records will not be recorded.
Now we can use the BlastErrorParser
just like a regular blast parser. Specifically, we might want to make an iterator that goes through our blast records one at a time and parses them with the error parser:
>>> result_handle = open(blast_file) >>> iterator = NCBIStandalone.Iterator(result_handle, blast_error_parser)
We can read these records one a time, but now we can catch and deal with errors that are due to problems with Blast (and not with the parser itself):
>>> try: ... next_record = iterator.next() ... except NCBIStandalone.LowQualityBlastError, info: ... print "LowQualityBlastError detected in id %s" % info[1]
The .next()
method is normally called indirectly via a for
-loop.
Right now the BlastErrorParser
can generate the following errors:
ValueError
– This is the same error generated by the regular BlastParser, and is due to the parser not being able to parse a specific file. This is normally either due to a bug in the parser, or some kind of discrepancy between the version of BLAST you are using and the versions the parser is able to handle.LowQualityBlastError
– When BLASTing a sequence that is of really bad quality (for example, a short sequence that is basically a stretch of one nucleotide), it seems that Blast ends up masking out the entire sequence and ending up with nothing to parse. In this case it will produce a truncated report that causes the parser to generate a ValueError. LowQualityBlastError
is reported in these cases. This error returns an info item with the following information:
item[0]
– The error message
item[1]
– The id of the input record that caused the error. This is really useful if you want to record all of the records that are causing problems.
As mentioned, with each error generated, the BlastErrorParser will write the offending record to the specified error_handle
. You can then go ahead and look and these and deal with them as you see fit. Either you will be able to debug the parser with a single blast report, or will find out problems in your blast runs. Either way, it will definitely be a useful experience!
Hopefully the BlastErrorParser
will make it much easier to debug and deal with large Blast files.
We should write some stuff to make it easier to deal directly with PSIBlast from scripts (i. e. output the align file in the proper format from an alignment). I need to look at PSIBlast more and come up with some good ways of going this...