BLAST, the Basic Local Alignment Sequence Tool (http://blast.ncbi.nlm.nih.gov/Blast.cgi) is a data mining tool provided by the National Center for Biotechnology Information (NCBI). It allows users to input a nucleic acid or protein sequence and search against millions of other sequences in the database to identify those based on similarity and not identity. Ideally, search results will hopefully identify known sequences and provide insight into the possible identity and function of the sequence of interest.
A BLAST search can be used for many purposes. During the 2016 in-person NCBI workshop “Fundamentals of Bioinformatics and Searching,” Diane Rein, the Bioinformatics and Molecular Biology Liaison at the Health Science Library at the University at Buffalo, shared several common ways that BLAST could be used:
- “Infer the function of a gene/protein by finding statistically significant matches, based on sequence similarity to:
- protein or nucleotide sequence of interest
- a genomic region
- Regulatory sequences
- RNA genes
- Map sequence to a chromosome
- Determine if sequence is expressed as transcript
- Find conserved domains with conserved functions in your sequence of interest
- Search for sequence motifs or patterns that are similar to a sequence of interest in a particular region
- Compare known sequences from different taxonomic groups
- Identify sequences to support PCR cloning
- Compare two or more sequences looking for:
- Cloning/sequencing artifacts” (Reim, 2016)
BLAST happens in a two–phase search. The first phase searches for exact matches in short areas of the sequence of interest called a word. For a nucleotide search, the default for each word is 28 nucleotides long. For a protein search, the default length of a word is 6 amino acids. For either sequence search you can change the word length; however, the longer the word length, the more stringent the search. The program “chops” the query’s 28 nucleotides or 6 amino acids into chunks and then forms a new word by shifting the one nucleotide or amino acid to the left. Once all of the word chops are completed, BLAST endeavors to find any matches within the database.
In the second phase, BLAST searches up and down stream of the subject sequence, attempting to find another match to form a “seed.” A “hit” is assigned when two or more words are matched within 40 bases or amino acids up or down stream of the word. A score is assigned to that sequence hit based on the degree of similarity found among all of the sequences searched. These scores are used to rank the hits from the most to the least similar to the known query sequence.
The default scoring rules for a nucleotide search is +2 for each match and -3 for each mismatch. A zero is given for each gap, but this is highly variable with each kind of BLAST. The numbers are added up and normalized as a total score. BLAST also calculates the probability that each hit is by chance and reports it as the e-value. The smaller the number, the smaller the chance that the hit was arrived at by chance. The match/mismatch scoring and gap scoring can be adjusted by the used in the algorithm parameters area of the search page; however, it is best not to change the BLAST default values until you are considered a BLAST power user.
The scoring rules for a protein BLAST search are rated using a matrix. The default scoring matrix is BLOSUM62. BLOSUM stands for BLOcks SUbstitution Matrix. “The 62 represents scoring values taken from a reference set of sequences that amongst themselves are 62% identical. BLOSUM80 would be 80% identical; BLOSUM45, 45% identical, and so on. BLOSUM matrices measure what stays the same over time.” (Reim, 2016). A BLOSUM 45 matrix allows the user to find more divergence and a BLOSUM 80 matrix allows the user to find less divergent sequences. Another set of matrices named PAM (Percent Acceptable Mutation) measures what changes or mutates over time. Users can adjust the matrix according to the questions that they are asking.
I have found that when performing either a nucleotide or protein BLAST search that excluding the “relatively low value” (Guide, 2016) model sequences (XM/XP) or the uncultured/environmental sample sequences improves the search results and increases the number of unique hits that are from verified source sequences.
Adjusting the algorithm parameters and excluding the models and other sample sequences can provide an enhanced BLAST search, but making these changes depends on the research question a user is asking. BLAST searching is a fast and efficient way of aligning nucleotide or protein sequences to aid discovery and function of input sequences. As always, when using BLAST, authors must provide attribution of the tool by acknowledging the following paper: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., “Basic local alignment search tool”. J. Mol. Biol. 215(3):403-410 (1990).
Reim, Diane. “Winter Session 2016 Bioinformatics Workshops: Introduction to BLAST Sequence Similarity Searching.” PowerPoint January 5, 2016.
Guide to BLAST home and search pages. 2016. (ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf). Accessed April 14, 2016.
- Pearson WR, (2014). “BLAST and FASTA similarity searching for multiple sequence alignment”. Methods Mol Biol. 1079:75-101. PubMed ID (PMID): 24170396.
- Rein, Diane C., (2013). “A Practical Primer to BLAST Sequence Similarity Searching,” In Chemical Information for Chemists: A Primer”, edited by Judith Currano and Dana Roth. Cambridge, United Kingdom, The Royal Society of Chemistry, pp. 253-297, 2013.
- Pearson WR, (2013). “An introduction to sequence similarity (“homology”) searching”. Curr Protoc Bioinformatics, Chapter 3: Unit3.1. PMID: 23749753.
Greg Nelson, Chemical & Life Science Librarian, Brigham Young University, firstname.lastname@example.org
We welcome your comments and suggestions. If you have a resource that you would like to see highlighted please leave us a comment.