–Delete states (circle): silent or null state. Do not match any residues, they are there so it is possible to jump over one or more columns: For modeling when just a few of the sequences have a “-” at a position. Example:
Pseudo-counts Dangerous to estimate a probability distribution from just a few observed amino acids. –If there are two sequences, with Leu at a position: P for Leu =1, but P = 0 for all other residues at this position But we know that often Val substitutes Leu. The probability of the whole sequence are easily become 0 if a single Leu is substituted by a Val. Or, the log-odds is minus infinity. How to avoid “over-fitting” (strong conclusions drawn from very little evidence)? Use pseudocounts: –Pretend to have more counts than those from the data. –A. Add 1 to all the counts: Leu: 3/23, other a.a.: 1/23
Adding 1 to all counts is as assuming a priori all a.a. are equally likely. Another approach: use background composition as pseudocounts.
Searching a database with HMM Know how to calculate the probability of a sequence in the alignment: –multiplying all the probabilities (or adding the log-odds scores) in the model along the path followed by that sequence. For sequences not in the alignment, we do not know the path. –Find a path through the model where the new sequence fits well: –we can then score it as before. Need to “align” the sequence to the model: –Assigning states to each residue in the sequence. –A given sequence can have many alignments.
Eg. A protein has a.a. as: A1, A2, A3, … HMM states as: M1, M2, M3, … for match states, I1, I2, I3, … for insertion states, An alignment: –A1 matches M1, A2 and A3 match I1, A4 matches M2, A5 matches M6 (after passing through three delete states). For each alignment, we can calculate the probability of the sequence, or the log-odds score. –Can find the best alignment. A dynamic programming method can be used: –Viterbi algorithm. –It also gives the probability of the sequence for that alignment, thus we have a score. The log-odds score found can be used to search databases for members of the same family.
Model estimation Profile HMMs can be viewed as a generalization of weight matrix to incorporate insertions and deletions. –We can also estimate the model: determining all the probability parameters from unaligned sequences. –A multiple alignment will also be produced. Iteratively done: –Start with a model: with random probabilities, or reasonable alignment of some of the sequences. –When all the sequences are aligned, we can use the alignment to improve the probabilities in the model, leading to a slightly different alignment. –Repeat this process and improve the probabilities again, until convergence: things no longer improve. –Final model is a multiple alignment.
Problems: –How do we choose the length of the model? This affects the number of inserts in the final alignment. –Iterative procedure can converge to local minimum: No guarantee that it will find the optimal multiple alignment.
An example of a HMM model: the self-loops on the diamond insertion states. The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family A possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities which connect the states. The probability of ACCY is shown as a highlighted path through the model.
Scoring: Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path. For ACCY, only the probability of the emitted amino acid is given. –The probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is : 4 *.3 *.46 *.6 *.97 *.5 *.015 *.73 *.01 * 1 = 1.76x10 -6. Simplification by transforming probabilities to logs so that addition can replace multiplication. –The resulting number is the raw score of a sequence, given the HMM. –the score of ACCY along the path shown is: log e (.4) + log e (.3) + log e (.46) + log e (.6) + log e (.97) + log e (.5) + log e (.015) + log e (.73) +log e (.01) + log e (1) = -13.25
The calculation is easy if the exact state path is known. In a real model, many different state paths through a model can generate the same sequence. –Therefore, the correct probability of a sequence is the sum of probabilities over all of the possible state paths. –Unfortunately, a brute force calculation of this problem is computationally unfeasible, except in the case of very short sequences. –Two good alternatives are: to calculate the sum over all paths inductively using the forward algorithm, or to calculate the most probable path through the model using the Viterbi algorithm.
Another HMM model: The Insert, Match, and Delete states can be labeled with their position number in the model, M1, D1 etc. –Because the number of insertion states is greater than the number of match or delete states, there is an extra insertion state at the beginning of the model, labeled I0. – several state paths through the model are possible for this sequence. I0 I1I2I3 M1M2M3
Viterbi Algorithm The most likely path through the model is computed with the Viterbi algorithm. –The algorithm employs a matrix. The rows of the matrix are indexed by the states in the model, The columns indexed by the sequence. Deletion states are not shown, since, by definition, they have a zero probability of emitting an amino acid. –The elements of the matrix are initialized to zero and then computed. –1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element of the matrix. –2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are entered into the matrix element indexed by C and I1/M1. –3. The maximum probability, max(I1, M1), is calculated. –4. A pointer is set from the winner back to state I0. –5. Steps 2-4 are repeated until the matrix is filled.
Prob(A in state I0) = 0.4*0.3=0.12 Prob(C in state I1) = 0.05*0.06*0.5 =.015 Prob(C in state M1) = 0.46*0.01 = 0.005 Prob(C in state M2) = 0.46*0.5 = 0.23 Prob(Y in state I3) = 0.015*0.73*0.01 =.0001 Prob(Y in state M3) = 0.97*0.23 = 0.22 The most likely path through the model can now be found by following the back-pointers.
Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path. In dishonest casino case, based on the sequences of rolls the Viterbi algorithm can recover the state sequence (which die casino was using) quite well. The spirit is similar to what we calculated last time: what’s the possibility the casino used a loaded die what three “6” showed up.
Projects Scoring matrix for alignment (insertion/deletion) HMM (contact prediction) Phylogenetic tree Genetic circuit (regulation based expression prediction) Microarray (clustering analysis)