2 Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters);many amino acids share related biophysical properties• codons are degenerate: changes in the third positionoften do not alter the amino acid that is specified• protein sequences offer a longer “look-back” timeDNA sequences can be translated into protein,and then used in pairwise alignments
3 Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate--to confirm the identity of a cDNA--to study noncoding regions of DNA--to study DNA polymorphisms--example: Neanderthal vs modern human DNAQuery: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240|||||||| |||| |||||| ||||| | |||||||||||||||||||||||||||||||Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
4 Definitions Similarity Identity Conservation The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation.IdentityThe extent to which two sequences are invariant.ConservationChanges at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.
6 Gaps • Positions at which a letter is paired with a null are called gaps.• Gap scores are typically negative.• Since a single mutational event may cause the insertionor deletion of more than one residue, the presence ofa gap is ascribed more significance than the lengthof the gap. Thus there are separate penalties for gap creation and gap extension.• In BLAST, it is rarely necessary to change gap valuesfrom the default.
7 Outline: pairwise alignment Overview and examplesDefinitions: homologs, paralogs, orthologsAssigning scores to aligned amino acids:Dayhoff’s PAM matricesAlignment algorithms: Needleman-Wunsch,Smith-WatermanStatistical significance of pairwise alignments
8 General approach to pairwise alignment Choose two sequencesSelect an algorithm that generates a scoreAllow gaps (insertions, deletions)Score reflects degree of similarityAlignments can be global or localEstimate probability that the alignmentoccurred by chance
13 Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins?Page 52
14 Dayhoff’s mutation probability matrix for the evolutionary distance of 1 PAMWe have considered three kinds of information:a table of number of accepted point mutations (PAMs)relative mutabilities of the amino acidsnormalized frequencies of the amino acids in PAM dataThis information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM).Page 50
15 Dayhoff’s PAM1 mutation probability matrix Original amino acidPage 55
16 Dayhoff’s PAM1 mutation probability matrix Each element of the matrix shows the probability that an originalamino acid (top) will be replaced by another amino acid (side)
17 Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates intoamino acid j for all pairs of amino acids.Substitution matrices are constructed by assemblinga large and diverse sample of verified pairwise alignments(or multiple sequence alignments) of amino acids.Substitution matrices should reflect the true probabilitiesof mutations occurring through a period of evolution.The two major types of substitution matrices arePAM and BLOSUM.
18 Point-accepted mutations PAM matrices:Point-accepted mutationsPAM matrices are based on global alignmentsof closely related proteins.The PAM1 is the matrix calculated from comparisonsof sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids.Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.All the PAM data come from closely related proteins(>85% amino acid identity).
19 Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwisealignment (or a BLAST search) we know what score toassign to two aligned amino acid residues.Logarithms are easier to use for a scoring system. Theyallow us to sum the scores of aligned residues (ratherthan having to multiply them).Page 57
20 How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”:the probability that an alignment is authenticthe probability that the alignment was randomThe score S for an alignment of residues a,b is given by:S(a,b) = 10 log10 (Mab/pb)As an example, for tryptophan,S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4Page 57
22 What do the numbers mean in a log odds matrix?S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4A score of +17 for tryptophan means that this alignmentis 50 times more likely than a chance alignment of twoTrp residues.S(a,b) = 17Probability of replacement (Mab/pb) = xThen17 = 10 log10 x1.7 = log10 x101.7 = x = 50Page 58
23 What do the numbers mean in a log odds matrix?A score of +2 indicates that the amino acid replacementoccurs 1.6 times as frequently as expected by chance.A score of 0 is neutral.A score of –10 indicates that the correspondence of twoamino acids in an alignment that accurately representshomology (evolutionary descent) is one tenth as frequentas the chance alignment of these amino acids.Page 58
26 More conservedLess conservedRat versusmouse RBPRat versusbacteriallipocalin
27 Comparing two proteins with a PAM1 matrix gives completely different results than PAM250!Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.hsrbp, 136 CRLLNLDGTCbtlact, 3 CLLLALALTC* ** * **A PAM250 matrix is very tolerant of mismatches.24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7%rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVbtlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN* **** * * * * ** *rbp CADMVGTFTDTEDPAKFKMbtlact 80 GECAQKKIIAEKTKIPAVFKI** * ** **Page 60
28 two nearly identical proteins two distantlyrelated proteins
29 BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix.BLOSUM62 is a matrix calculated from comparisons ofsequences with no less than 62% divergence.Page 60
32 BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons ofclosely related proteins.The BLOCKS database contains thousands of groups ofmultiple sequence alignments.BLOSUM62 is the default matrix in BLAST 2.0.Though it is tailored for comparisons of moderately distantproteins, it performs well in detecting closer relationships.A search for distant relatives may be more sensitivewith a different matrix.Page 60
35 Rat versusmouse globinRat versusbacterialglobinPage 61
36 Two randomly diverging protein sequences change in a negatively exponential fashionPercent identity“twilight zone”Evolutionary distance in PAMsPage 62
37 At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residuesAt PAM80, there are 50 differences per 100 residuesAt PAM250, there are 80 differences per 100 residuesPercent identity“twilight zone”Differences per 100 residuesPage 62
38 PAM: “Accepted point mutation” Two proteins with 50% identity may have 80 changesper 100 residues. (Why? Because any residue can besubject to back mutations.)Proteins with 20% to 25% identity are in the “twilight zone”and may be statistically significantly related.PAM or “accepted point mutation” refers to the “hits” ormatches between two sequences (Dayhoff & Eck, 1968)Page 62
39 Ancestral sequence Sequence 1 ACCGATC Sequence 2 AATAATC ACCCTAC A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> CSequence 1ACCGATCSequence 2AATAATCLi (1997) p.70
40 Percent identity between two proteins: What percent is significant? 100%80%65%30%23%19%We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values.As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous.
41 Outline: pairwise alignment Overview and examplesDefinitions: homologs, paralogs, orthologsAssigning scores to aligned amino acids:Dayhoff’s PAM matricesAlignment algorithms: Needleman-Wunsch,Smith-WatermanStatistical significance of pairwise alignments
42 An alignment scoring system is required to evaluate how good an alignment is• positive and negative values assigned• gap creation and extension penalties• positive score for identities• some partial positive score forconservative substitutions• global versus local alignment• use of a substitution matrix
44 Two kinds of sequence alignment: global and localWe will first consider the global alignment algorithmof Needleman and Wunsch (1970).We will then explore the local alignment algorithmof Smith and Waterman (1981).Finally, we will consider BLAST, a heuristic versionof Smith-Waterman. We will cover BLAST in detailon Monday.Page 63
45 Global alignment with the algorithm of Needleman and Wunsch (1970) • Two sequences can be compared in a matrixalong x- and y-axes.• If they are identical, a path along a diagonalcan be drawn• Find the optimal subpaths, and add them up to achievethe best score. This involves--adding gaps when needed--allowing for conservative substitutions--choosing a scoring system (simple or complicated)N-W is guaranteed to find optimal alignment(s)Page 63
46 Three steps to global alignment with the Needleman-Wunsch algorithm  set up a matrix score the matrix identify the optimal alignment(s)Page 63
47 Four possible outcomes in aligning two sequences 12 identity (stay along a diagonal) mismatch (stay along a diagonal) gap in one sequence (move vertically!) gap in the other sequence (move horizontally!)Page 64
52 Needleman-Wunsch: dynamic programming N-W is guaranteed to find optimal alignments, althoughthe algorithm does not search all possible alignments.It is an example of a dynamic programming algorithm:an optimal path (alignment) is identified byincrementally extending optimal subpaths.Thus, a series of decisions is made at each step of thealignment to find the pair of residues with the best score.Page 67
53 Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps):
56 Global alignment versus local alignment Global alignment (Needleman-Wunsch) extendsfrom one end of each sequence to the other.Local alignment finds optimally matchingregions within two sequences (“subsequences”).Local alignment is almost always used for databasesearches such as BLAST. It is useful to find domains(or limited regions of homology) within sequences.Smith and Waterman (1981) solved the problem ofperforming optimal local sequence alignment. Othermethods (BLAST, FASTA) are faster but less thorough.Page 69
57 Rapid, heuristic versions of Smith-Waterman: FASTA and BLASTSmith-Waterman is very rigorous and it is guaranteedto find an optimal alignment.But Smith-Waterman is slow. It requires computerspace and time proportional to the product of the twosequences being aligned (or the product of a queryagainst an entire database).Gotoh (1982) and Myers and Miller (1988) improved thealgorithms so both global and local alignment requireless time and space.FASTA and BLAST provide rapid alternatives to S-W.Page 71
58 How FASTA works  A “lookup table” is created. It consists of short stretches of amino acids (e.g. k=3 for a protein search).The length of a stretch is called a k-tuple. The FASTAalgorithm finds the ten highest scoring segmentsthat align to the query. These ten aligned regions are re-scored with aPAM or BLOSUM matrix. High-scoring segments are joined. Needleman-Wunsch or Smith-Waterman is then performed.Page 72
59 Outline: pairwise alignment Overview and examplesDefinitions: homologs, paralogs, orthologsAssigning scores to aligned amino acids:Dayhoff’s PAM matricesAlignment algorithms: Needleman-Wunsch,Smith-WatermanStatistical significance of pairwise alignments
60 Statistical significance of pairwise alignment We will discuss the statistical significance of alignment scores in the next lecture (BLAST). A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha?
64 Randomization test: scramble a sequence • First compare two proteins and obtain a scoreRBP: RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84+ K GTW++MA L + A V T L+ W+glycodelin: QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81• Next scramble the bottom sequence 100 times,and obtain 100 “randomized” scores (+/- standard deviation)• Composition and length are maintained• If the comparison is “real” we expect the authentic scoreto be several standard deviations above the mean of the“randomized” scoresPage 76
65 A randomization test shows that RBP is significantly related to b-lactoglobulin100 random shufflesMean score = 8.4Std. dev. = 4.5Number of instancesReal comparisonScore = 37Quality scoreBut this test assumes a normal distribution of scores!In the BLAST lecture we will consider the distributionof scores and more appropriate significance tests.Page 77
66 Z = (Sreal – Xrandomized score) standard deviation You can perform this randomization testusing the UVA FASTA programs(http://fasta.bioch.virginia.edu/).Z = (Sreal – Xrandomized score)standard deviation
67 The PRSS program performs a scramble test for you(http://fasta.bioch.virginia.edu/fasta/prss.htm)BadscoresGoodscoresBut these scores are notnormally distributed!