We continue where we stopped last week: FASTA – BLAST
Published byModified over 4 years ago
Presentation on theme: "We continue where we stopped last week: FASTA – BLAST"— Presentation transcript:
1 We continue where we stopped last week: FASTA – BLAST Alignment Class IIIWe continue where we stopped last week: FASTA – BLAST
2 FASTA FastA uses the method of Pearson and Lipman (1985). Based on the idea of identifying short words, k-tuples, common to both sequences.K-tuples of 1-2 aa are used in protein searches and up to 6 bases in DNA searches.Uses a heuristic approach to join k-tuples that lie close together on the same diagonal.If a significant number of matches is found, FastA uses a dynamic programming algorithm to compute gapped alignments that incorporate the ungapped regions.E-value approaching zero indicates that no match with a similar score is expected by chance.
3 FASTA-StagesFind k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences)Score and select top 10 scoring “local diagonals”For proteins, each k-tup found is scored using the PAM250 matrixFor DNA, the number of k-tups foundPenalize intervening gaps
4 Finding k-tups position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t aprotein a c s p r kposition in offsetamino acid protein A protein B pos A - posBacknprstNote the common offset for the 3 amino acids c,s and pA possible alignment is thus quickly found -protein 1 n c s p t a| | |protein 2 a c s p r k
6 FASTA-StagesRescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores.Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected scoreAfter finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
7 FASTA - E ScoresIn evaluating the E scores, the following rules of thumb can be used:For searches of database, sequences with E less than are almost always found to be homologous.Sequences with E between 1 and 10 frequently turn out to be related as well.
8 BLAST Basic Local Alignment Search Tool Altschul et al. 1990,1994,1997Heuristic method for local alignmentDesigned specifically for database searchesIdea: Good alignments contain short lengths of exact matches
9 Blast ApplicationBlast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastXBlastN - nt versus nt databaseBlastP - protein versus protein databaseBlastX - translated nt versus protein databasetBlastN - protein versus translated nt databasetBlastX - translated nt versus translated nt databaseQuery: DNA ProteinDatabase: DNA Protein
10 Ungapped Blast Algorithm Given two sequences, a segment pair is defined as a pair of sub-sequences of the same length that form an ungapped alignmentBlast calculates all segment pairs between the query and the database sequences, above a scoring thresholdThe algorithm searches for fixed length hits, which are extended until certain threshold parameters are achievedThe resulting high-scoring pairs (HSPs) form the basis of the ungapped alignments
11 Mathematical Basis of BLAST Model matches as a sequence of coin tossesLet p be the probability of a “head”For a “fair” coin, p = 0.5(Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads isR = log1/p (n).Example: Suppose n = 20 for a “fair” coinR=log2(20)=4.32Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
12 Mathematical Basis of BLAST To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”.For DNA, the probability of a “head” is 1/4Same logic applies to amino acidsAATCATATTCAGHTHHHT
13 Mathematical Basis of BLAST So, for one particular alignment, the Erdös-Rényi property can be appliedWhat about for all possible alignments?Consider that sequences are being shifted back and forth, dot matrix plotThe expected length of the longest match isR=log1/p(mn)where m and n are the lengths of the two sequences.
14 Steps of BLAST Filter out low-complexity regions where L is length, N is alphabet size, ni is the number of letter i appearing in sequence. Example: AAATK=1/4 log4(24/(3!*1!*0!*0!))=0.25
15 Steps of BLASTQuery words of length 3 (for proteins) or 11 (for DNA) are created from query sequence using a sliding windowMEFPGLGSLGTSEPLPQFVDPALVSSMEFEFPFPGPGLGLGThe values 90 and 64 can easily be obtained from the expected run length formula shown earlier.
16 Steps of BLASTUsing BLOSUM62 (for proteins) or scores of +5/-4 (DNA, PAM40), score all possible words of length 3 or 11 respectively against a query word.Select a neighborhood word score threshold (T) so that only most significant sequences are kept. Approximately 50 hits per query word.Repeat 3 and 4 for each query word in step 2. Total number of high scoring words is approximately 50 * sequence length.
17 Steps of BLAST Organize the high-scoring words into a search tree Scan each database sequence for match to high-scoring words. Each match is a seed for an ungapped alignment.MEFGP
18 Steps of BLAST(Original BLAST) extend matching words to the left and right using ungapped alignments. Extension continues as long as score increases or stays same. This is a HSP (high scoring pair).(BLAST2) Matches along the same diagonal within a distance A of each other are joined and then the longer sequence extended as before.
19 Steps of BLASTUsing a cutoff score S, keep only the extended matches that have a score at least S.Determine statistical significance of each remaining match (from last time).Try to extend the HSPs if possible.Show Smith-Waterman local alignments.
20 Shanon Entropy and information Information theoryShanon Entropy and information
21 Entropy X: discrete Random Variable (RV), p(X) Entropy (or self-information)Entropy measures the amount of information in a RVusing the optimal code = the entropy will be the minimum
22 Entropy (cont) i.e when the value of X is determinate, hence H is a weighted average for log(p(X) where the weighting depends on the probability of each xH INCREASES WITH MESSAGE LENGTHi.e when the value of Xis determinate, henceproviding no newinformation
23 Joint EntropyThe joint entropy of 2 RV X,Y is the amount of the information needed on average to specify both their values
24 Conditional EntropyThe conditional entropy of a RV Y given another X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X
26 Mutual InformationI(X,Y) is the mutual information between X and Y. It is the reduction of uncertainty of one RV due to knowing about the other, or the amount of information one RV contains about the other
27 Mutual Information (cont) I is 0 only when X,Y are independent: H(X|Y)=H(X)H(X)=H(X)-H(X|X)=I(X,X) Entropy is the self-informationFor 2 dependent variables, I grows not only with the degree of their dependence but only with their entropyH(X) = I(X<X)This explain also how mutual information between 2 totally dependent variables is not constant but depends on their entropy
28 Kullback-Leibler Divergence Relative entropy or KL (Kullback-Leibler) divergence
30 Scoring Matrices Types Identity matrix – exact matches receive one score and non-exat matches a different score (say 1 and 0, or 6 and –1 for local alignment.).Mutation data matrix – a scoring matrix compiled based on observation of protein point mutation (PAM, BLOSUM).Physical properties matrix – amino acids with with similar properties (e.G. hydrophobicity ) receive high score.Genetic code matrix – amino acids are scored based on similarities in the coding triple (codons).
31 Substitution MatrixAmino acids substitute easily for another due to similar physicochemical propertiesIsoleucine for Valine (both small, hydrophobic)Serine for Threonine (both polar)Such changes – “conservative”Thus, need a way to increase sensitivity of the alignment algorithmSolution – substitution matrixTherefore, we need a range of values that depend on the nature of sequences being comparedIdentical amino acids > Conservative substitutions > Nonconservative substitutions
32 Choice of scoring matrix is dictated by the alignment goals Two proteins are homologous if (and only if) they are evolutionarily related (have a common ancestor)Homologous proteins are likely to have related functions (and have the same fold)Scoring matrices must in some way model our understanding of protein evolution.Based on the result of the search we have to be able to decide if the discovered sequence similarity could happen by chance or is a signature of likely homology.