Mathematics and computation behind BLAST and FASTA

Mathematics and computation behind BLAST and FASTA
Xuhua Xia

Why string matching? Fast computational methods in string matching
Early applications: Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF) M. D. Waterfield et al Nature 304:35-39 R. F. Doolittle et al Science 221: Implications: Cancer can be caused by a constitutively expressed growth factor Alteration of gene expression can contribute to cancer Growth factors and the like can be drug targets against cancer Fast computational methods in string matching FASTA BLAST Local pair-wise alignment by dynamic programming Biotech applications, e.g., transgenes in genetic modification of crops should be first checked to avoid inadvertent production of GM food that may lead to allergic response. The very first check is through BLAST/FASTA against allergen databases (FAO/WHO). Transgenic soy that has been genetically engineered to express groundnut 2S albumin was found to elicit hypersensitivity reactions in groundnut allergic people. Details in Nordlee et al N Engl J Med 334: 688 A good entry point for bioinformatics Help refresh students’ memory in statistical concepts. Pearson, WR, DJ Lipman Improved tools for biological sequence comparison. PNAS 85: Altschul, SF, W Gish, W Miller, EW Myers, DJ Lipman Basic local alignment search tool. Journal of Molecular Biology 215: Altschul, SF, DJ Lipman Protein database searches for multiple alignments. PNAS 87:

FASTA A commonly used family of alignment and search tools
Generally considered to be more sensitive than BLAST. Illustration with two fictitious sequences used in the Contig Assembly lecture: Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Slide 3

String Match in FASTA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Left Right Seq1 A C G T Move N Seq2 -1 -2 (b) -3 -4 -5 -6 -7 -8 -9 -10 (c) -11 -12 -13 -14 -15 The indexing is (b) also help with exact string matching. For example, if we have a sequence (e) (d) Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA

Word length of 2 One of the three 2nd best Best (a) 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 Left Right Seq1 A C G T Move N Seq2 -1 -2 (b) AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT -3 -4 -5 -6 -7 -8 -9 -10 (c) -11 -12 -13 -14 (d) (e) One of the three 2nd best Best Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA

Human COX1 RWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMI FFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTV YPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLI TAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFG MISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAI PTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAH FHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIGVNLTFFPQHFLGLSGMPR RYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSMNLEWLYGCPP PYHTFEEP Leu is the most frequent. Exact match: All sequences in the database are pre-indexed. Cys is the rarest in this protein in the database. If a query sequence contain a C, then go directly to C at site 494 to check; if the query has no C, then report 'No match'

BLAST Adapted from Crane & Raymer 2003
Motivation: matching short sequences are faster than matching longer ones Input sequence: AILVPTVIGCTVPT Algorithm: Break the query sequence into words: AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT Discard common words (i.e., words made entirely of common amino acids) Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPT MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC Critical decision: Discard or continue? The E-value as an answer. Slide 7

Basic stats in string matching
Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is: prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2 Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATG Target CGATTGCCCG The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1) Slide 8

Basic stats in string matching
Probability of having a sequence match: p Probability of having no match: q = 1-p Binomial distribution: When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np). Slide 9

From Binomial to Poisson
Slide 10

Matching two sequences without gap
Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. M: query length; N: target length, e.g., M = 8, N = 5, L = 3 AACGGTTC CGGTT A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target. m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences. The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST. S is calculated differently in the gapped BLAST AACGGTTC 1 CGGTT AACGGTTC 2 CGGTT AACGGTTC 3 CGGTT AACGGTTC 3 CGGTT AACGGTTC 3 CGGTT AACGGTTC CGGTT AACGGTTC CGGTT AACGGTTC CGGTT 1 Slide 11

Blast Output (Nuc. Seq.) BLASTN [Aug ] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Score E Sequences producing significant alignments: (bits) Value MG bases e-004 Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 Constant gap penalty vs affine function penalty Typically one would count only 1 GE here. Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * * 2-34 = 7.878E-04 x p(x) … Alternatively, E = KmnExp(-lambda*R) Keep in mind that E value computed with gapped matches are generally overestimates, i.e., if a reported E-value is 0.1, the real E-value tends to be smaller.

E-Value in BLAST The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. It is NOT the probability of finding the reported match. Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Slide 13

E-value and P(1) Slide 14

BLAST Programs Program Database Query Typical Uses BLASTN/MEGABLAST
Nucleotide MEGABLAST has longer word size than BLASTN BLASTP Protein Query a protein/peptide against a protein database. BLASTX Translate a nuc sequence into a “protein” in six frames and search against a protein database TBLASTN Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched TBLASTX 6-frame translation of both query and database PHI-BLAST Pattern-hit iterated BLAST PSI-BLAST Position-specific iterated BLAST RPS-BLAST Reverse PSI-BLAST The last three are related to PWM. BLASTN is very different from the other programs because it compares nucleotides to one another; all the other programs compare proteins or translations. One difference is in the seeding procedure. BLASTN is more complicated than BLASTP because the complementary strand needs to be searched, too. The two most similar programs are BLASTX and TBLASTN. In these, one sequence is protein, and the other is a translation. TBLASTX is the most computationally intensive program because it translates each sequence in 6 frames. It is a difficult program to interpret because much of the time the sequences are 1anonymous, so you have no idea what a similarity might mean. It is also difficult to run properly because the standard scoring matrices are flawed. PHI-BLAST: 1. does the query contains a particular pattern? 2. what sequences containing the pattern are also similar to the query? 3. PWM. PSI-BLAST: 1. BLASTP, 2. PWM. 3. Significant? 4. Scan for more, 5. New PWM, and so on. RPS-BLAST: Query protein against a database of PWMs. Slide 15

Comparison: BLAST and FASTA
BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA. For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right. Both perform dynamic programming for extending the match after the initial match. Slide 16

Optional: BLAST Parameters
Lambda  and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value. Both  and K depend on nucleotide (or aminon acid) frequencies match-mismatch matrix All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies. For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate  and K.

Lambda () and K BLAST output includes lambda () and K. Mathematically,  is defined as follows: where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to For nuc sequence: Expand the summation and you will get 16 terms, of which 4 terms involve the S_ii (same nucleotides) and 12 involves S_ij (different nucleotides) Now insert different  values to the equation above to find which  balances the equation (not the trivial solution of  = 0) (for amino acid sequences) See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.

 implies nucleotide frequencies
(a) (a’) A G C T 0.25 0.49 0.01 0.0625 0.2401 0.0049 0.0001 Match-mismatch matrix Match-Mismatch matrix 1 -3 Lambda 1.39E-05 (b) (b’) (c) (c’) BLAST parameters , K and H are computed for each BLAST database created.

Finding  III: Different , s/v

Finding K: equal , (1, -3) Double-click it, copy to EXCEL and find  by using solver. Slide 21

Mathematics and computation behind BLAST and FASTA

Similar presentations

Presentation on theme: "Mathematics and computation behind BLAST and FASTA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mathematics and computation behind BLAST and FASTA

Similar presentations

Presentation on theme: "Mathematics and computation behind BLAST and FASTA"— Presentation transcript:

Similar presentations

About project

Feedback