3Refresher: pairwise alignments Alignment score is calculated from substitution matrixIdentities on diagonal have high scoresSimilar amino acids have high scoresDissimilar amino acids have low (negative) scoresGaps penalized by gap-opening + gap elongationA 5RNDCQEG.A R N D C Q E G ...K L A A S V I L S D A LK L A A S D A Lx (-1)=-13
4Refresher: pairwise alignments The number of possible pairwise alignments increases explosively with the length of the sequences:Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways1060 bottles of beer would fill up our entire galaxy
5Refresher: pairwise alignments Solution:dynamic programmingEssentially:the best path through any grid point in the alignment matrix must originate from one of three previous pointsFar fewer computationsBest alignment guaranteed to be foundT C G C ATCAx
6Refresher: pairwise alignments Most used substitution matrices are themselves derived empirically from simple multiple alignmentsA/A %A/C %A/D %...Multiple alignmentCalculatesubstitutionfrequenciesConvertto scoresFreq(A/C),observedFreq(A/C),expectedScore(A/C) = log
8Multiple alignments: what use are they? Starting point for studies of molecular evolution
9Multiple alignments: what use are they? Characterization of protein families:Identification of conserved (functionally important) sequence regionsPrediction of structural features (disulfide bonds, amphipathic alpha-helices, surface loops, etc.)
10Scoring a multiple alignment: the “sum of pairs” score AA: 4, AS: 1, AT:0AS: 1, AT: 0ST: 1Score: = 7...A......S......T...One columnfrom alignmentIn theory, it is possible to define an alignment scorefor multiple alignments (there are many alternative scoring systems)
11Multiple alignment: dynamic programming is only feasible for very small data sets In theory, optimal multiple alignment can be found by dynamic programming using a matrix with more dimensions (one dimension per sequence)BUT even with dynamic programming finding the optimal alignment very quickly becomes impossible due to the astronomical number of computationsFull dynamic programming only possible for up to about 4-5 protein sequences of average lengthEven with heuristics, not feasible for more than 7-8 protein sequencesNever used in practiceDynamic programming matrix for 3 sequencesFor 3 sequences, optimal path must comefrom one of 7 previous points
12Multiple alignment: an approximate solution Progressive alignment (ClustalX and other programs):Perform all pairwise alignments; keep track of sequence similarities between all pairs of sequences (construct “distance matrix”)Align the most similar pair of sequencesProgressively add sequences to the (constantly growing) multiple alignment in order of decreasing similarity.
13Progressive alignment: details Perform all pairwise alignments, note pairwisedistances (construct “distance matrix”)2) Construct pseudo-phylogenetic tree from pairwise distancesS1 S2 S3 S4S1S2 3SSS1S2S36 pairwisealignmentsS4S1 S2 S3 S4S1S2 3SSS1S3S4S2“Guide tree”
14Progressive alignment: details Use tree as guide for multiple alignment:Align most similar pair of sequences using dynamic programmingAlign next most similar pairAlign alignments using dynamic programming - preserve gapsS1S3S1S3S4S2S2S4S1S3S2S4New gap to optimize alignmentof (S2,S4) with (S1,S3)
15Aligning profiles + Full dynamic programming on two profiles Aligning alignments: each alignment treated as a single sequence (a profile)+S2S4Full dynamic programmingon two profilesS1S3S2S4New gap to optimize alignmentof (S2,S4) with (S1,S3)
16Scoring profile alignments Compare each residue in one profile to all residues in second profile. Score is average of all comparisons....A......S...AS: 1, AT:0SS: 4, ST:1Score: = 1.5+...S......T...4One columnfrom alignment
17Additional ClustalX heuristics Sequence weighting:scores from similar groups of sequences are down-weightedVariable substitution matrices:during alignment ClustalX uses different substitution matrices depending on how similar the sequences/profiles areVariable gap penalties:gap penalties depend on substitution matrixgap penalties depend on similarity of sequencesreduced gap penalties at existing gapsincreased gap penalties CLOSE to existing gapsreduced gap penalties in hydrophilic stretches (presumed surface loop)residue-specific gap penaltiesand more...
20Global methods (e.g., ClustalX) get into trouble when data is not globally related!!!
21Global methods (e.g., ClustalX) get into trouble when data is not globally related!!!
22Global methods (e.g., ClustalX) get into trouble when data is not globally related!!! Possible solutions:Cut out conserved regions of interest and THEN align themUse method that deals with local similarity (e.g. DIALIGN)