Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Hidden Markov Models Eine Einführung.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models: an Introduction by Rachel Karchin.
CpG islands in DNA sequences
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
By: Manchikalapati Myerow Shivananda Monday, April 14, 2003
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Introduction to Profile Hidden Markov Models
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004

–Delete states (circle): silent or null state. Do not match any residues, they are there so it is possible to jump over one or more columns: For modeling when just a few of the sequences have a “-” at a position. Example:

Pseudo-counts Dangerous to estimate a probability distribution from just a few observed amino acids. –If there are two sequences, with Leu at a position: P for Leu =1, but P = 0 for all other residues at this position But we know that often Val substitutes Leu. The probability of the whole sequence are easily become 0 if a single Leu is substituted by a Val. Or, the log-odds is minus infinity. How to avoid “over-fitting” (strong conclusions drawn from very little evidence)? Use pseudocounts: –Pretend to have more counts than those from the data. –A. Add 1 to all the counts: Leu: 3/23, other a.a.: 1/23

Adding 1 to all counts is as assuming a priori all a.a. are equally likely. Another approach: use background composition as pseudocounts.

Searching a database with HMM Know how to calculate the probability of a sequence in the alignment: –multiplying all the probabilities (or adding the log-odds scores) in the model along the path followed by that sequence. For sequences not in the alignment, we do not know the path. –Find a path through the model where the new sequence fits well: –we can then score it as before. Need to “align” the sequence to the model: –Assigning states to each residue in the sequence. –A given sequence can have many alignments.

Eg. A protein has a.a. as: A1, A2, A3, … HMM states as: M1, M2, M3, … for match states, I1, I2, I3, … for insertion states, An alignment: –A1 matches M1, A2 and A3 match I1, A4 matches M2, A5 matches M6 (after passing through three delete states). For each alignment, we can calculate the probability of the sequence, or the log-odds score. –Can find the best alignment. A dynamic programming method can be used: –Viterbi algorithm. –It also gives the probability of the sequence for that alignment, thus we have a score. The log-odds score found can be used to search databases for members of the same family.

Model estimation Profile HMMs can be viewed as a generalization of weight matrix to incorporate insertions and deletions. –We can also estimate the model: determining all the probability parameters from unaligned sequences. –A multiple alignment will also be produced. Iteratively done: –Start with a model: with random probabilities, or reasonable alignment of some of the sequences. –When all the sequences are aligned, we can use the alignment to improve the probabilities in the model, leading to a slightly different alignment. –Repeat this process and improve the probabilities again, until convergence: things no longer improve. –Final model is a multiple alignment.

Problems: –How do we choose the length of the model? This affects the number of inserts in the final alignment. –Iterative procedure can converge to local minimum: No guarantee that it will find the optimal multiple alignment.

An example of a HMM model: the self-loops on the diamond insertion states. The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family A possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities which connect the states. The probability of ACCY is shown as a highlighted path through the model.

Scoring: Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path. For ACCY, only the probability of the emitted amino acid is given. –The probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is : 4 *.3 *.46 *.6 *.97 *.5 *.015 *.73 *.01 * 1 = 1.76x Simplification by transforming probabilities to logs so that addition can replace multiplication. –The resulting number is the raw score of a sequence, given the HMM. –the score of ACCY along the path shown is: log e (.4) + log e (.3) + log e (.46) + log e (.6) + log e (.97) + log e (.5) + log e (.015) + log e (.73) +log e (.01) + log e (1) =

The calculation is easy if the exact state path is known. In a real model, many different state paths through a model can generate the same sequence. –Therefore, the correct probability of a sequence is the sum of probabilities over all of the possible state paths. –Unfortunately, a brute force calculation of this problem is computationally unfeasible, except in the case of very short sequences. –Two good alternatives are: to calculate the sum over all paths inductively using the forward algorithm, or to calculate the most probable path through the model using the Viterbi algorithm.

Another HMM model: The Insert, Match, and Delete states can be labeled with their position number in the model, M1, D1 etc. –Because the number of insertion states is greater than the number of match or delete states, there is an extra insertion state at the beginning of the model, labeled I0. – several state paths through the model are possible for this sequence. I0 I1I2I3 M1M2M3

Viterbi Algorithm The most likely path through the model is computed with the Viterbi algorithm. –The algorithm employs a matrix. The rows of the matrix are indexed by the states in the model, The columns indexed by the sequence. Deletion states are not shown, since, by definition, they have a zero probability of emitting an amino acid. –The elements of the matrix are initialized to zero and then computed. –1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element of the matrix. –2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are entered into the matrix element indexed by C and I1/M1. –3. The maximum probability, max(I1, M1), is calculated. –4. A pointer is set from the winner back to state I0. –5. Steps 2-4 are repeated until the matrix is filled.

Prob(A in state I0) = 0.4*0.3=0.12 Prob(C in state I1) = 0.05*0.06*0.5 =.015 Prob(C in state M1) = 0.46*0.01 = Prob(C in state M2) = 0.46*0.5 = 0.23 Prob(Y in state I3) = 0.015*0.73*0.01 =.0001 Prob(Y in state M3) = 0.97*0.23 = 0.22 The most likely path through the model can now be found by following the back-pointers.

Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path. In dishonest casino case, based on the sequences of rolls the Viterbi algorithm can recover the state sequence (which die casino was using) quite well. The spirit is similar to what we calculated last time: what’s the possibility the casino used a loaded die what three “6” showed up.

Projects Scoring matrix for alignment (insertion/deletion) HMM (contact prediction) Phylogenetic tree Genetic circuit (regulation based expression prediction) Microarray (clustering analysis)