Multiple sequence alignment (msa)

Multiple sequence alignment (msa)
A) define characters for phylogenetic analysis B) search for additional family members SeqA N ∙ F L S SeqB N ∙ F – S SeqC N K Y L S SeqD N ∙ Y L S NYLS NKYLS NFS NFLS +K L YF

Complexity of performing msa
Sum of pairs (SP) score Extension of sequence pair alignment by dynamic programming O(LN), where L is sequence length and N is number of sequences MSA algorithm searches subspace bounded by Heuristic multiple alignment (suboptimal) Pairwise optimal alignments

Pairwise projection Pairwise optimal alignment
Greedy multiple alignment Global msa optimum (sequences A,B,C,…) is probably found in the area in-between Sequence A Sequence B

Heuristic alignment algorithms
Motivation: existing alignment methods Heuristic alignment algorithms

Progressive alignment
1. derive a guide tree from pairwise sequence distances 2. align the closest (groups of) sequences Pluses: works with many sequences Minuses: errors made early on are frozen in the final alignment Programs: Clustal, Pileup, T-Coffee

Progressive pairwise alignment
Motivation: existing alignment methods Progressive pairwise alignment Evolution Alignment TACAGTA CACCTA TATA TACGTC TACAGTA TACGTA TACGTA TACGTC TACCTA TATA CACCTA time progress

Motivation: existing alignment methods Progressive pairwise alignment 1. TACAGTA T A C A G T A match vertical gap horizontal gap 1. T A C G T A C A G T A T A C G T C 2. TACGTC 3. TATA CACCTA T A 2. T A T A progress C A T 3. C A C G T A

Motivation: existing alignment methods Progressive pairwise alignment 1. T A C A G T A T A C G T C 2. T A T A 3. C A C G T A T A C A G T A T A C A G T A T A C G T C T A T A C A C G T A T A C G T C T A T A C A C C T A

Profile HMMs Motivation: existing alignment methods
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF

Iterative methods Gotoh DIALIGN inner loop uses progressive alignment
outer loop recomputes distances and guide tree based on the msa obtained DIALIGN Identify high-scoring ungapped segment pairs Build up msa using consistent segment pairs Greedy algorithm DIALIGN2; exact DIALIGN3

Motifs - several proteins are grouped together by similarity searches - they share a conserved motif - motif is stringent enough to retrieve the family members from the complete protein database

Local msa Profile analysis a portion of the alignment that is highly conserved and produces a type of scoring matrix called a profile. New sequences can be aligned to the profile using dynamic programming. Block analysis scans a global msa for ungapped regions, called blocks, and these blocks are then used in sequence alignments. Pattern-searching or statistical methods find recurrent regions of sequence similarity in a set of initially unaligned sequences.

TEIRESIAS Finds all patterns with minimum support in input set of unaligned sequences Maximum spacer length For example L…G…………..A….L…L Enumerative algorithm Build up longer patterns by merging overlapping short patterns For example A….L + L…L  A….L…L Fewer instances by chance when more positions are specified Pluses: exact Minuses: biological significance of the patterns?

Expectation maximization (EM)
Initial guess of motif location in each sequence E-step: Estimate frequencies of amino acids in each motif column. Columns not in motif provide background frequencies. For each possible site location, calculate the probability that the site starts there. M-step: use the site probabilities as weights to provide a new table of expected values for base counts for each of the site positions. Repeat E-step and M-step until convergence.

EM procedure Eqn 1 Eqn 2 Eqn 1 Eqn 2
Let the observed variables be known as y and the latent variables as z. Together, y and z form the complete data. Assume that p is a joint model of the complete data with parameters θ: p(y,z | θ). An EM algorithm will then iteratively improve an initial estimate θ0 and construct new estimates θ1 through θN. An individual re-estimation step that derives from takes the following form (shown for the discrete case; the continuous case is similar): Eqn 1 In other words, θn + 1 is the value that maximizes (M) the expectation (E) of the complete data log-likelihood with respect to the conditional distribution of the latent data under the previous parameter value. This expectation is usually denoted as Q(θ): Eqn 2 Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm. It can be shown that an EM iteration does not decrease the observed data likelihood function, and that the only stationary points of the iteration are the stationary points of the observed data likelihood function. In practice, this means that an EM algorithm will converge to a local maximum of the observed data likelihood function. EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known. "Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model. An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood. There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function. Eqn 1 Eqn 2

Exercise Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T
Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Analyze the following ten DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle three positions are a motif. The size of the motif is a guess that is based on a molecular model. The alignment of the sequences is also a guess.

(a) Calculate the observed frequency of each base at each of the three middle positions
Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(b) Calculate the odds likelihood of finding the motif at each of the possible locations in sequence 5 Seq5 CAGAT CAGAT: 0.1*0.6*0.2/0.25/0.25/0.25=0.768 0.1*0.1*0.2/0.25/0.25/0.25=0.128 0.1*0.6*0.4/0.25/0.25/0.25=1.536 Non-motif sites: 0.25/0.25=1 Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(c) Calculate the probability of finding the motif at each position of sequence 5
Seq5 CAGAT CAGAT: 0.1*0.6*0.2*0.25*0.25= 0.25*0.1*0.1*0.2*0.25= 0.25*0.25*0.1*0.6*0.4=0.0015 Non-motif sites: p=0.25 Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4

(d) Calculate what change will be made to the base count in each column of the motif table as a result of matching the motif to sequence 5. CAGAT: 0.1*0.6*0.2*0.25*0.25= , rel. weight 0.32 CAGAT: 0.25*0.1*0.1*0.2*0.25= , rel. weight 0.05 CAGAT: 0.25*0.25*0.1*0.6*0.4=0.0015, rel. weight 0.63 Base 1st 2nd 3rd A 0.05 =0.95 C 0.32 G 0.63 T Updated counts from sequence 5 (initial location guess shaded).

(e) What other steps are taken to update or maximize the table values?
The weighted sequence data from the remaining sequences are also added to the counts table The base frequencies in the new table are used as an updated estimate of the site residue composition. The expectation and maximization steps are repeated until the base frequencies do not change.

Gibbs sampler Start from random msa; realign one sequence against profile derived from the other sequences; iterate Finds the most probable pattern common to all of the sequences by sliding them back and forth until the ratio of the motif probability to the background probability is a maximum.

A. Estimate the amino acid frequencies in the motif columns of all but one sequence.
Also obtain background. Random start Motif positions chosen ↓ xxxMxxxxx xxxMxxxxx xxxxxxMxx xxxxxxMxx xxxxxMxxx xxxxxMxxx xMxxxxxxx xMxxxxxxx Xxxxxxxxx Xxxxxxxxx Mxxxxxxxx Mxxxxxxxx xxxxMxxxx xxxxMxxxx xxxxxxxxM xxxxxxxxM B. Use the estimates from A to calculate the ratio of probability of motif to background score at each position in the left-out sequence. This ratio for each possible location is the weight of the position. xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx M M M M

C. Choose a new location for the motif in the left-out sequence by a random selection
using the weights to bias the choice. xxxxxxxMxx Estimated location of the motif in left-out sequence. D. Repeat steps A to C many (>100) times.

Exercise: Gibbs sampler
Seq1 C CAG A Seq2 G TTA A Seq3 G TAC C Seq4 T TAT T Seq5 C AGA T Seq6 T TTT G Seq7 A TAC T Seq8 C TAT G Seq9 A GCT C Seq10 G TAG A Analyze the left-hand-side DNA sequences by the Gibbs sampling algorithm.

(a) Assuming that the background base frequencies are 0
(a) Assuming that the background base frequencies are 0.25, calculate a log odds matrix for the central three positions. Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4 Base 1st 2nd 3rd A -1.32 1.26 -0.32 C G T 1.49 0.68

GTTTG: -1.32 + -0.32 + 0.68 = -1.00 GTTTG: 1.49 + -0.32 + 0.68 = 1.81
(b) Assuming that another sequence GTTTG is the left-out sequence, slide the log-odds matrix along the left-out sequence and find the log-odds score at each of three possible positions. Base 1st 2nd 3rd A -1.32 1.26 -0.32 C G T 1.49 0.68 GTTTG: = -1.00 GTTTG: = 1.81 GTTTG: = 0.85

log-odds=log2(p/bg), p=2s bg GTTTG: 2-1.00/64=0.078, 0.1*0.2*0.4=0.008
(c) Calculate the probability of a match at each position in the left-out sequence Base 1st 2nd 3rd A 0.1 0.6 0.2 C G T 0.7 0.4 log-odds=log2(p/bg), p=2s bg GTTTG: /64=0.078, 0.1*0.2*0.4=0.008 GTTTG: /64=0.055, 0.7*0.2*0.4=0.056 GTTTG: 20.85/64=0.028, 0.7*0.2*0.2=0.028

(d) How do we choose a possible location for the motif in the left-out sequence?
=0.092 Normalised weights: GTTTG: /0.092=0.09 GTTTG: /0.092=0.61 GTTTG: 0.028/0.092=0.30

Modelling protein families

PSSM The PSSM is constructed by a logarithmic transformation of a matrix giving the frequency of each amino acid in the motif. If a good sampling of sequences is available, the number of sequences is sufficiently large, and the motif structure is not too complex, it should, in principle, be possible to produce a PSSM that is highly representative of the same motif in other sequences also. Pseudocounts Sequence logo, information content

Exercise: construct a PSSM
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 0.6 0.1 0.2 C 0.7 G T

Exercise cntd (a) Assuming the background frequency is 0.25 for each base, calculate a log odds score for each table position, i.e. log to the base 2 of the ratio of each observed value to the expected frequency.

Observed/expected frequencies
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 0.6/0.25=2.4 0.1/0.25=0.4 0.2/0.25=0.8 C 0.7/0.25=2.8 G T

Log-odds scores A C G T Base Column 1 frequency Column 2 frequency

Exercise cntd (b) Align the matrix with each position in the sequence TCACGTAA starting at position 1,2, etc., and calculate the log odds score for the matrix to that position.

A C G T TCACGTAA: -0.32 + 1.49 + -0.32 + -1.32 = -0.47
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 1.26 -1.32 -0.32 C 1.49 G T TCACGTAA: = -0.47 TCACGTAA: = -5.28 TCACGTAA: = 4.5 TCACGTAA: = -2.81 TCACGTAA: = -4.28

(c) Calculate the probability of the best matching position.
p(TCACGTAA): 0.6*0.7*0.6*0.7=0.1764

Alignment of sequences with a structure: hidden Markov models
HMMs suit well on describing correlation among neighbouring sites probabilistic framework allows for realistic modelling well developed mathematical methods; provide best solution (most probable path through the model) confidence score (posterior probability of any single solution) inference of structure (posterior probability of using model states) we can align multiple sequence using complex models and simultaneously predict their internal structure

Coin game Fair coin: p(1)=p(0)=0.5 Biased coin: p(1)=0.8, p(0)=0.2
… Biased coin: p(1)=0.8, p(0)=0.2 … Observed series: P(fair coin)=0.515 P(biased coin)= P(biased coin)/P(fair coin)=0.07

Markov process State depends only on previous state
Markov models for sequences States have emission probabilities State transition probabilities

Hidden Markov model for two-state variable
t=transition probability p=emission probability t(1,2) t(2,end) end 1 2 HMM p1(a) p1(b) p2(a) p2(b) state, p 1  1  2  end a b a observed sequence, x t(1,1) t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p | HMM)

profile-HMM insert match delete begin 1 2 3 4 end
match state emits one of 20 amino acids insert state emits one of 20 amino acids delete, begin, end states are mute

A possible hidden Markov model for the protein ACCY.

HMM with multiple paths through the model for ACCY
HMM with multiple paths through the model for ACCY. The highlighted path is only one of several possibilities.

Viterbi algorithm computes the sequence scores over the most likely path rather than over the sum of all paths.

Forward algorithm Similar to Viterbi except that a sum rather than the maximum is computed

What the Score Means Once the probability of a sequence has been determined, its score can be computed. Because the model is a generalization of how amino acids are distributed in a related group (or class) of sequences, a score measures the probability that a sequence belongs to the class. A high score implies that the sequence of interest is probably a member of the class, and a low score implies it is probably not a member.

Optimisation The Baum-Welch algorithm is a variation of the forward algorithm described earlier. It begins with a reasonable guess for an initial model and then calculates a score for each sequence in the training set over all possible paths through this model . During the next iteration, a new set of expected emission and transition probabilities is calculated. The updated parameters replace those in the initial model, and the training sequences are scored against the new model. The process is repeated until model convergence, meaning there is very little change in parameters between iterations. The Viterbi algorithm is less computationally expensive than Baum-Welch.

Heuristics There is no guarantee that a model built with either the Baum-Welch or Viterbi algorithm has parameters which maximize the probability of the training set. As in many iterative methods, convergence indicates only that a local maximum has been found. Several heuristic methods have been developed to deal with this problem.

Heuristics: parallel trials
start with several initial models and proceed to build several models in parallel. When the models converge at several different local optimums, the probability of each model given the training set is computed, and the model with the highest probability wins.

Heuristics: add noise add noise, or random data, into the mix at each iteration of the model building process. Typically, an annealing schedule is used. The schedule controls the amount of noise added during each iteration. Less and less noise is added as iterations proceed. The decrease is either linear or exponential. The effect is to delay the convergence of the model. When the model finally does converge, it is more likely to have found a good approximation to the global maximum.

Sequence weighting

Overfitting and regularization
CGGSLLNAN--TVLTAAHC CGGSLIDNK-GWILTAAHC CGGSLIRQG--WVMTAAHC CGGSLIREDSSFVLTAAHC

Dirichlet mixtures A sophisticated application of this method is known as Dirichlet mixtures. The mixtures are created by statistical analysis of the distribution of amino acids at particular positions in a large number of proteins. The mixtures are built from smaller components known as Dirichlet densities. A Dirichlet density is a probability density over all possible combinations of amino acids appearing in a given position. It gives high probability to certain distributions and low probability to others. For example, a particular Dirichlet density may give high probability to conserved distributions where a single amino acid predominates over all others. Another possibility is a density where high probability is given to amino acids with a common identifying feature, such as the subgroup of hydrophobic amino acids. When an HMM is built using a Dirichlet mixture, a wealth of information about protein structure is factored into the parameter estimation process. The pseudocounts for each amino acid are calculated from a weighted sum of Dirichlet densities and added to the observed amino acid counts from the training set. The parameters of the model are calculated as described above for simple pseudocounts.

Log-odds ratio This number is the log of the ratio between two probabilities — the probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model, whose parameters reflect the general amino acid distribution in the training sequences.

Limitations of profile-HMM
The HMM is a linear model and is unable to capture higher order correlations among amino acids in a protein molecule. In reality, amino acids which are far apart in the linear chain may be physically close to each other when a protein folds. Chemical and electrical interactions between them cannot be predicted with a linear model. Another flaw of HMMs lies at the very heart of the mathematical theory behind these models: the probability of a protein sequence can be found by multiplying the probabilities of the amino acids in the sequence. This claim is only valid if the probability of any amino acid in the sequence is independent of the probabilities of its neighbors. In biology, this is not the case. There are, in fact, strong dependencies between these probabilities. For example, hydrophobic amino acids are highly likely to appear in proximity to each other. Because such molecules fear water, they cluster at the inside of a protein, rather than at the surface where they would be forced to encounter water molecules.

A simplified hidden Markov model (HMM)
C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 BEGIN A 0.1 C 0.2 G 0.2 T 0.5 A 0.4 C 0.3 G 0.1 T 0.2 A 0.1 C 0.4 G 0.4 T 0.1 END 0.9 P=1 P=1 P=1

(a) Calculate the probability of the sequence TAG by following a path through the model’s three match states. A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 BEGIN A 0.1 C 0.2 G 0.2 T 0.5 A 0.4 C 0.3 G 0.1 T 0.2 A 0.1 C 0.4 G 0.4 T 0.1 END 0.9 P=1 P=1 P=1 0.7*0.5*0.7*0.4*0.7*0.4*0.9=0.0247

(b) Repeat a for a path that goes first to the insert state, then to a match state, then to a delete state, then to a match state and End. A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 A 0.25 C 0.25 G 0.25 T 0.25 BEGIN A 0.1 C 0.2 G 0.2 T 0.5 A 0.4 C 0.3 G 0.1 T 0.2 A 0.1 C 0.4 G 0.4 T 0.1 END 0.9 P=1 P=1 P=1 0.1*0.25*1*0.1*0.2*1*1*0.4*0.9=

(c) Which of the two paths is the more probable one, and what is the ratio of the probability of the higher to the lower one? p(path1)/p(path2) = / = 137.2 The highest-scoring path is the best alignment of the sequence with the model. The Viterbi algorithm is similar to dynamic programming and finds the highest-scoring path.

Improving the model Adjust the scores for the states and transition probabilities by aligning additional sequences with the model using an HMM adaptation of the expectation maximization algorithm. In the expectation step, calculate all of the possible paths through the model, sum the scores, and then calculate the probability of each path. Each state and transition probability is then updated by the maximization step of the algorithm to make the model better predict the new sequence.

Information theory primer
Information, Uncertainty Suppose we have a device that can produce 3 symbols, A, B, or C  “uncertainty of 3 symbols” Uncertainty = log2(M) with M being the number of symbols Logarithm base 2 gives units in bits. Example: In reading mRNA, if the ribosome encounters any one of 4 equally likely bases, then the uncertainty is 2 bits.

Surprisal u Let symbols have probabilities Pi Surprisal Ui = -log2(Pi)
For an infinite string of symbols, the average surprisal H=S Piui = - S Pi log2 Pi (bits per symbol) Sum over all symbols I H is Shannon’s entropy

H function in the case of two symbols

If all symbols are equally likely?
Hequiprobable = - S 1/M log2 (1/M) = log2 M

Coding Shorter codes using few bits for common symbols and many bits for rare symbols. M=4: A C G T with probabilities P(A)=1/2, P(C)=1/4, P(G)=1/8, P(T)=1/8 Surprisals U(A)=1 bit, U(C)=2 bits, U(G)=U(T)=3 bits Uncertainty is H=1/2*1+1/4*2+1/8*3+1/8*3=1.75 (bits per symbol)

Recode symbols so that the number of binary digits equals the surprisal
The string ACATGAAC is coded as 14 / 8 = 1.75 bits per symbols

Noisy communication channels
Sender ----[noise]---- receiver Two equally likely symbols sent at a rate of 1 bit per second if x=0 is sent, probability of receiving y=0 is 0.99 and probability of receiving y=1 is 0.01, and vice versa Then the uncertainty after receiving a symbol is Hy(x)=-0.99log log20.01=0.081 So the actual rate of transmission is R= =0.919 bits per second

Exercise: information content
Base Column 1 frequency Column 2 frequency Column 3 frequency Column 4 frequency A 0.6 0.1 0.2 C 0.7 G T

Information content of a scoring matrix by the relative entropy method (ignores background frequencies) (a) calculate the entropy or uncertainty (Hc) for each column and for the entire matrix. Hc=-S pic log2(pic), where pic is the frequency of amino acid type i in column c. Column 1: Hc=-(0.6 log2(0.6) + 2*0.1 log2(0.1)+0.2 log2(0.2))=1.571 Column 2: Hc=-(0.7 log2(0.7) + 3*0.1 log2(0.1)=1.358 Column 3: Hc=-(0.6 log2(0.6) + 2*0.1 log2(0.1)+0.2 log2(0.2))=1.571 Column 4: Hc=-(0.7 log2(0.7) + 3*0.1 log2(0.1)=1.358 Log2(x)=y y=log(x)/log(2)

(b) calculate the decrease in uncertainty or amount of information (Rc) for column 1 due to these data (for DNA, Rc=2-Hc and for proteins, Rc=4.32-Hc). If it’s DNA: Rc=2-Hc =2-1.57=0.43 If it’s protein: Rc=2-Hc = =2.75

(c) calculate the amount that the uncertainty is reduced (or the amount of information contributed) for each base in column 1. f(A)=0.6, HA,1=-0.6log2(0.6)=0.442 f(G)=f(C)=0.1, HG,1= HC,1 =-0.1log2(0.1)=0.332 f(T)=0.2, HT,1=-0.2log2(0.2)=0.464

Database searching • The first and most common operation in protein informatics...and the only way to access the information in large databases • Primary tool for inference of homologous structure and function • Improved algorithms to handle large databases quickly • Provides an estimate of statistical significance • Generates alignments • Definitions of similarity can be tuned using different scoring matrices and algorithm-specific parameters

Types of alignment Sequence-sequence Sequence-profile Profile-profile
Target distribution = generic substitution matrix Sequence-profile Position-specific target distributions Profile-profile Observed frequencies from multiple alignment Average both ways Pair HMM Probability that two HMMs generate same sequence

PSI-Blast Position-specific-iterated Blast

Steps in a PSI-Blast search
• Constructs a multiple alignment from a Gapped Blast search and generates a profile from any significant local alignments found • The profile is compared to the protein database and PSI-BLAST estimates the statistical significance of the local alignments found, using "significant" hits to extend the profile for the next round • PSI-BLAST iterates step 2 an arbitrary number of times or until convergence

PHI-Blast Pattern-hit-initiated Blast

Multiple sequence alignment (msa)

Similar presentations

Presentation on theme: "Multiple sequence alignment (msa)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple sequence alignment (msa)

Similar presentations

Presentation on theme: "Multiple sequence alignment (msa)"— Presentation transcript:

Similar presentations

About project

Feedback